{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# NLTK"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "NLTK is a leading platform for building Python programs to work with human language data.  It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, and an active discussion forum.\n",
    "\n",
    "Library documentation: <a>http://www.nltk.org/</a>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# needed to display the graphs\n",
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "showing info http://nltk.github.com/nltk_data/\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# import the library and download sample texts\n",
    "import nltk\n",
    "nltk.download()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "*** Introductory Examples for the NLTK Book ***\n",
      "Loading text1, ..., text9 and sent1, ..., sent9\n",
      "Type the name of the text or sentence to view it.\n",
      "Type: 'texts()' or 'sents()' to list the materials.\n",
      "text1: Moby Dick by Herman Melville 1851\n",
      "text2: Sense and Sensibility by Jane Austen 1811\n",
      "text3: The Book of Genesis\n",
      "text4: Inaugural Address Corpus\n",
      "text5: Chat Corpus\n",
      "text6: Monty Python and the Holy Grail\n",
      "text7: Wall Street Journal\n",
      "text8: Personals Corpus\n",
      "text9: The Man Who Was Thursday by G . K . Chesterton 1908\n"
     ]
    }
   ],
   "source": [
    "from nltk.book import *"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Displaying 11 of 11 matches:\n",
      "ong the former , one was of a most monstrous size . ... This came towards us , \n",
      "ON OF THE PSALMS . \" Touching that monstrous bulk of the whale or ork we have r\n",
      "ll over with a heathenish array of monstrous clubs and spears . Some were thick\n",
      "d as you gazed , and wondered what monstrous cannibal and savage could ever hav\n",
      "that has survived the flood ; most monstrous and most mountainous ! That Himmal\n",
      "they might scout at Moby Dick as a monstrous fable , or still worse and more de\n",
      "th of Radney .'\" CHAPTER 55 Of the Monstrous Pictures of Whales . I shall ere l\n",
      "ing Scenes . In connexion with the monstrous pictures of whales , I am strongly\n",
      "ere to enter upon those still more monstrous stories of them which are to be fo\n",
      "ght have been rummaged out of this monstrous cabinet there is no telling . But \n",
      "of Whale - Bones ; for Whales of a monstrous size are oftentimes cast up dead u\n"
     ]
    }
   ],
   "source": [
    "# examine concordances (word + context)\n",
    "text1.concordance(\"monstrous\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "imperial subtly impalpable pitiable curious abundant perilous\n",
      "trustworthy untoward singular lamentable few determined maddens\n",
      "horrible tyrannical lazy mystifying christian exasperate\n"
     ]
    }
   ],
   "source": [
    "text1.similar(\"monstrous\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "a_pretty is_pretty a_lucky am_glad be_glad\n"
     ]
    }
   ],
   "source": [
    "text2.common_contexts([\"monstrous\", \"very\"])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "image/png": [
       "iVBORw0KGgoAAAANSUhEUgAAAakAAAEZCAYAAAAt5touAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\n",
       "AAALEgAACxIB0t1+/AAAIABJREFUeJzt3Xu4HFWZ7/HvDwJyJwnwiCIYEJWAkWBQLhPIDjDeTsDk\n",
       "iAIKKp4z4mhEHWYAxRkSPTpRZw5BFJhxRgRERQUzEB1umo5yEwIhBAhogKCAIEFuotzf+aNWpWvX\n",
       "7t637N577eT3eZ56unrVury1+vJ2VfXerYjAzMwsRxuMdABmZmbtOEmZmVm2nKTMzCxbTlJmZpYt\n",
       "JykzM8uWk5SZmWXLScrWG5IOkHTnEPSzStLBa9H+/ZIuX9s4hspQzcsgxn1J0i7DPa6NLk5Slq21\n",
       "TQZ1EfHLiNhtKLpKSw+Svi3pWUlPpmW5pC9J2qoSxwUR8bYhiGNIDOG8dCNpQkpET6XlXkknDaKf\n",
       "D0n65VDHZ6ODk5TlrG0yyFgAX46IrYBtgWOBfYFrJG02UkFJGsnX+tYRsSVwFPBPkt46grHYKOMk\n",
       "ZaOOCidLWilptaQLJY1L286S9KNK3S9Luiqtd0n6XWXbjpIulvSH1M8Zqfw1kn6eyh6R9B1JWw8k\n",
       "RICIeC4ilgCHAdtQJKxuRwZpX06T9LCkJyTdKmn3tO3bks6WdEU6KmtI2qkS/26SrpT0qKQ7Jb2n\n",
       "su3baS5+KulPQJekd0q6I/V1v6QT2szLxDTWY5Juk3Rord9vSFqY+rm+v6fsIuJ64HbgDT0mTNpa\n",
       "0nnpsVgl6ZQ0NxOBs4D90tHYH/v7INi6wUnKRqPjKd74DwReATwGfCNt+ztgkqQPSjoA+DDwgXoH\n",
       "kjYEFgL3Aq8GdgC+X6nyxdT3RGBHYM5gg42IPwFXAge02PzWVP7aiNgaeA9QfSN+H/B5iqOyW4AL\n",
       "Uvybpz6/A2wHHAmcmd7US0cBX4iILYBrgf8E/iYd5e0B/LwejKSNgEuBy1K/nwAukPS6SrUjKOZj\n",
       "HLCSYq56k/KN/iqNu7RFnTOALYGdgWkUj9mxEbEC+ChwXURsGRHj+xjL1jFOUjYaHQd8LiIejIjn\n",
       "gbnA4ZI2iIi/AMcApwHnA7Mj4sEWfbyFIgn9Q0T8JSKejYhrACLi7oj4WUQ8HxGrU1/T1jLm3wOt\n",
       "3mCfp3hznpjivysiHqpsXxgRV0fEc8ApFEcUrwJmAPdGxLkR8VJE3AJcTJHkSgsi4rq0T88AzwF7\n",
       "SNoqIp6IiFbJYl9g84iYFxEvRMQiimR+VKXOxRGxJCJepEiak/vY99XAo8A3gZNSn2ukDwxHAJ+J\n",
       "iKcj4j7gXykeR0hHprZ+cpKy0WgC8ON0Ouox4A7gBeDlABFxA3BPqvvDNn3sCNwXES/VN0h6uaTv\n",
       "p1NiT1Aku23WMuYdKN6ou4mInwNfpzgSfFjSv0nastwM3F+p+zTFUdYrKY7+9innIM3D+0hzkNqu\n",
       "OYWXvBt4J7Aqnc7bt0Wcr2zR7r5UXvb7cGXbX4At2u51YZuIGB8Ru0fE11ts3xbYKI1T+i3FnNl6\n",
       "zknKRqPfAm+PiHGVZbOI+D2ApI8DGwMPAie26eN3wE7pU3zdl4AXgTekU3DHMLDXSrcve0jaAjgE\n",
       "aPkNtYg4IyL2BnYHXgf8Q9mUIplW+xkPPEAxB4trc7BlRHy8bVDF0c9MitN4C4AftKj2ILCjpOrR\n",
       "y6vTmJ2ymuKIckKlbCeaCXq0fXnGhpCTlOVuY0mbVJYxwNnAl8ovEUjaTtJhaf11wBeA91Nc1zhR\n",
       "0p4t+r2B4hTcPEmbpb73T9u2AJ4GnpS0A82k0R9KC5JeJmkKRUJ4FDinR2Vpb0n7pGtBfwaeoUiQ\n",
       "pXdK+itJG6f9ui4iHgB+ArxO0tGSNkrLmyWVXyVXbZyNVPx91tbpNN1TtXFKv0pxnJjadFGcWiyv\n",
       "1w35qbcUzw+AL0raQtKrgU9TXG+D4sjtVWmObD3jJGW5+ynFm2a5/BNwOnAJcIWkJ4HrgLeko6Lz\n",
       "gXkRsTwiVgKfBc6vvMEFrHljPBTYleKo5HfAe1OducCbgCcovkRwEf3/NB8Ub/BPUhwhnAvcCOyf\n",
       "rpeVdcr+tgL+neI03qrU5quVet8FTqVIcnsBR6f4n6L40sWRFEc5vwf+meIIsj5G6Wjg3nQK8yMU\n",
       "ibwaN+na16HAO4BHKE5FHhMRv+6l397mpr/bPkHxweAeiiPOC2gm9Z9RfCvwIUl/6KU/WwfJP3po\n",
       "lidJ5wD3R8Q/jnQsZiPFR1Jm+fK32my95yRllq/R+B83zIaUT/eZmVm2fCRlZmbZGjPSAeRCkg8p\n",
       "zcwGISI6dv3UR1IVEZH9cuqpp454DOtCjI7Tcea+jJY4O81JyszMsuUkZWZm2XKSGmW6urpGOoQ+\n",
       "jYYYwXEONcc5tEZLnJ3mr6AnksJzYWY2MJIIf3HCzMzWR05SZmaWLScpMzPLlpOUmZlly0nKzMyy\n",
       "5SRlZmbZcpIyM7NsOUmZmVm2nKTMzCxbTlJmZpYtJykzM8uWk5SZmWXLScrMzLLlJGVmZtlykjIz\n",
       "s2w5SZmZWbacpMzMLFtOUmZmli0nKTMzy5aTlJmZZctJyszMsuUkZWZm2XKSMjOzbDlJmZlZtpyk\n",
       "zMwsW05SZmaWrRFLUhLHSRyT1j8k8YrKtm9KTByp2MzMLA+KiJGOAYlFwN9HcNPIxaDIYS7MzEYT\n",
       "SUSEOtX/sB1JSXxAYpnELRLnSZwqcYLEu4G9gQskbpbYRKIhMUXiUImlablL4p7U15RUZ4nEZRLb\n",
       "p/KGxDyJX6X6U1P5HqlsaYph11YxNhrNBWD+/O5l5TJ7dvN2/vxmvfnzu/dVLa+2r96HZj/1cer1\n",
       "Zs1qPbf1mEtl/XJbff/qbat1W5k1q4h10qTithxv1qzW+1+WlftR1q/HUx+7WlatD825r96vt2m3\n",
       "D/U6ZWzVWNrFVo+7HLfed3UOquWtnkfVvgcyL632rz5Pvc1Hu3HqWrXrrT4Uz43+PK69jVOqz2W7\n",
       "8lbtq2Xl87O/9Vtt621/2r3+W/XR6rVXfx8on1tTp/Z8PyjHW58MS5KS2AM4BZgewWTgk2lTRHAR\n",
       "sAR4XwRviuAZINK2SyPYK4K9gFuAr0qMAc4A3h3B3sA5wBfL/oANI9gH+BRwair/KHB66mcKcH+r\n",
       "OOtPmgULWr+5LFzYvF2woFlvwYLufVXLq+2r96HZT32cer1Fi1rPbz3mUll/qJLUokVFrCtWFLfl\n",
       "eIsWtd7/sqzcj7L+QJJUtT405756v96mv0mqjK0aS7vY6nGX49b7rs5Btby3JFWfo8Ekqfo8jVSS\n",
       "WrFi6JJUfS7blfeVdMrnZ3/rt9rW2/60e/236qO3JFW+D5TPrSVLer4flOOtT8YM0zgHAT+I4I8A\n",
       "ETymngeHbQ8XJU4E/hzBWRJvAPYArkp9bAg8WKl+cbq9GZiQ1q8FTpF4FXBxBCvXZmfMzGx4DFeS\n",
       "CnpJQpU6PUgcArwbOLAsAm6PYP82/Tybbl8k7V8E35O4HpgB/FTiuAh6HJc0GnMq611AVx8hm5mt\n",
       "XxqNBo2+DqmH0HAlqZ8DP5b4/xH8UWJ8Ki8T11PAVvVGEq8GvgG8NWJN8rkL2E5i3wiul9gIeG0E\n",
       "d7QbXGKXCO4BzpDYCZgEPZNUV9ecyvoA99DMbD3Q1dVFV+UNcu7cuR0db1iSVAR3SHwRWCzxIrAU\n",
       "WEXz6OnbwNkSf4Y1R0gCPgiMBxakU3sPRDBD4nDgaxJbp304DVomqbL/90ocDTwP/J7mNSwzM8vY\n",
       "cB1JEcF5wHlttl1M81oSwPR0exPw+Rb1lwHTWpRPr6yvBnZJ6/OAeX3FWD96mjkTJk/uWW/16qLu\n",
       "6tWwa/qe4OTJMHZs977Gjm2WV/up358xo+inPla93vTptFSNe+bMnvXL7a2ODutlvR1BTp8OO+wA\n",
       "ixfDtGnN8caNa+5vvZ+yrNyP6py2G7u+P/W5qm6fMaNnm3b7UK/TKt529+txr17dun51/qvlvc1r\n",
       "2aa/89Kqr/o89TYf/Ympt3a9mThx4OO1K6/PZbvyvp7X06f3/fzobd/62p9y7gfyfGpVVr4PrExX\n",
       "zPfeu3udsv9287KuyuLvpHLgv5MyMxu4debvpMzMzAbKScrMzLLlJGVmZtlykjIzs2w5SZmZWbac\n",
       "pMzMLFtOUmZmli0nKTMzy5aTlJmZZctJyszMsuUkZWZm2XKSMjOzbDlJmZlZtpykzMwsW05SZmaW\n",
       "LScpMzPLlpOUmZlly0nKzMyy5SRlZmbZcpIyM7NsOUmZmVm2nKTMzCxbTlJmZpYtJykzM8uWk5SZ\n",
       "mWXLScrMzLLlJGVmZtlykjIzs2wNKElJzJE4oVPBmJmZVQ30SCo6EkU/SYwZyfFz0mgMfV8D7bPR\n",
       "GNo4bGTNnz/0j2m7vsryqVOb486fX9yvxlKuV2PsRDxDqdU8zp7dLKsvZVz112G1/fr8OuszSUmc\n",
       "InGXxC+B16ey10j8t8QSiV9Ia8q/LXGmxHUSd0t0SZwrcYfEOZU+j5K4VWK5xLxK+dslbpK4ReLK\n",
       "VDZH4nyJq4FzJV6dxrwpLftV2p+U+r1F4ksSu0jcVNn+2ur90cxJyobaggXDn6SWLGmOu2BBcb8a\n",
       "S7lejbET8QylVvO4cKGT1GD1emQiMQU4AtgT2Ai4GbgJ+DfgoxGslNgHOBM4ODUbG8F+EocBlwD7\n",
       "AXcAN0rsCTwCzAPeBDwOXCHxLuBa4N+BAyK4T2JsJZTdgKkRPCuxKfDXaf21wHeBN0u8AzgMeEsE\n",
       "z0iMjeBxiSck9oxgGXAs8K21mjEzMxs2fZ0+OwC4OIJngGckLgE2AfYHfiitqbdxug3g0rR+G/BQ\n",
       "BLcDSNwOTEhLI4JHU/kFwIHAi8AvIrgPIILHK31eEsGzlbG+nhLei8BrU/khwLdSrNX2/wEcK/F3\n",
       "wHuBN7fb2Tlz5qxZ7+rqoqurq4/pMTNbvzQaDRrDeGjXV5IKQLWyDYDHI9irTZvn0u1LsCaxlPfH\n",
       "AM/X6tf7b+XPlfVPA7+P4BiJDaFISm1iBbgIOBX4ObAkgsfaDVJNUmZm1lP9A/zcuXM7Ol5f16R+\n",
       "AcyU2ERiS+BQioRxr8ThABKSeGM/xwvgBmCaxDYpyRwJNIDrgQMlJqR+x7fpYyvgobT+AWDDtH4l\n",
       "xRHTpqn9OIB0BHY5cBY0r4uZmVn+ej2SimCpxIXAMuAPFAkmgPcDZ0l8juJa1feAW8tm1S5a9PmQ\n",
       "xMnAIoojn4URxSlCiY8AF0tsADwMvK1FP2cCF0l8ALgM+FPq93KJycASieeAnwCfS22+C8wCruh9\n",
       "OkaPoTwTWfY10D59NnTdMnMmTJ48tH22e46U5Xvv3Rx37Fh44YWescyc2T3GTsQzlFrN44wZvY9d\n",
       "3dbq9bg+v9YUMaLfKh8WEn8PbBnBqe3rKNaHuTAzG0qSiIj+XLYZlHX+744kfgzsDBw00rGYmdnA\n",
       "rBdHUv3hIykzs4Hr9JGU/3efmZlly0nKzMyy5SRlZmbZcpIyM7NsOUmZmVm2nKTMzCxbTlJmZpYt\n",
       "JykzM8uWk5SZmWXLScrMzLLlJGVmZtlykjIzs2w5SZmZWbacpMzMLFtOUmZmli0nKTMzy5aTlJmZ\n",
       "ZctJyszMsuUkZWZm2XKSMjOzbDlJmZlZtpykzMwsW05SZmaWLScpMzPLlpOUmZlly0nKzMyy1bEk\n",
       "JXG8xB0S5w9xvw2JKUPZp5mZ5amTR1J/CxwSwTFlgcSYIeg30jKiZs+GRqN5f/785nq1fKg1GsVS\n",
       "jlcfa6jGLvspx2vXb7VetWwgcfQ1xnAqY5g1q2dZb/UHuq2/9WbPbl2vr777M3b1OVuut3seV+ej\n",
       "XSzVx3Awr4eyfTlW+Twvl0YDJk3q3uf8+UX9ss7s2cXt1KmtY6m/bqqx1ee6HL/Va6F8/VfXyz7K\n",
       "eFrNUV+v27JdedtoFPtSLjvv3H1bqz7WNR1JUhJnA7sAl0k8LnGexNXAuRLbSvxI4oa07J/abC7x\n",
       "LYlfSdwscVgq31Ti++mo7GJg08o4R0ncKrFcYl6l/E8SX5G4TeJKiX0lFkvcLXHoUOzjwoXdnxQL\n",
       "FjTXhyNJleM5SQ2tMoZFi3qW9VZ/oNv6W2/hwtb1hiJJVZ+z5Xq753F1PtrFUn0MB/N6KNuXY5XP\n",
       "83JpNGDFiu59LlhQ1C/rLFxY3C5Z0jqW+uumGlt9rsvxW70Wytd/db3so4yn1Rz19bot25W3jUax\n",
       "L+Vy333dt7XqY13TkSQVwUeBB4Eu4DRgInBwBO8HvgacFsFbgMOB/0jNTgF+FsE+wEHAVyU2ozgi\n",
       "+1MEuwOnQnGqT+KVwDxgOjAZeLPEu1Jfm6W+3gA8BXw+9TkrrZuZ2SgwFKffeqN0e0kEz6b1Q4CJ\n",
       "0po6W0psDrwVOFTi71P5y4CdgAOA0wEiWC5xa+r3zUAjgkcBJC4ADgT+C3gugstTP8uBZyJ4UeI2\n",
       "YEK7YOfMmbNmvauri66urkHttJnZuqrRaNAYxsO2Tiep0p8r6wL2ieC5aoWUtP53BL9pUS56ql+X\n",
       "UqXs+Ur5S1CMFcFLvV0XqyYpMzPrqf4Bfu7cuR0dbyS+gn4FcHx5R2LPtHp5rXyvtPoL4H2p7A3A\n",
       "GymS0Q3ANIltJDYEjgQWdzx6MzMbNp08koo268cD35BYlsZfDHwM+AIwP53O2wC4BzgMOAs4R+IO\n",
       "YAWwBCCChyROBhZRHEUtjODSFuP1FsugzZgB1bOBM2c21zt5lrDse+zY1mMN1dhlP33116reQGPI\n",
       "6axqGcv06T3Leqs/0G39rTdjRut6/X1celN9zpbr7Z7H1floF8vavh7KesuWNe+Xz3OAyZPhoou6\n",
       "1505E8aNg2nTivsrV8Kuu8ILL3SvU4+rVcyt5nrs2GLcet3Vq5v3q+szZsADDxTxlO3q8db7qm+f\n",
       "PLn7uFdd1az3wAPNOnU5vY6GkiJG/NvcWZAUngszs4GRRES0uiQzJPwfJ8zMLFtOUmZmli0nKTMz\n",
       "y5aTlJmZZctJyszMsuUkZWZm2XKSMjOzbDlJmZlZtpykzMwsW05SZmaWLScpMzPLlpOUmZlly0nK\n",
       "zMyy5SRlZmbZcpIyM7NsOUmZmVm2nKTMzCxbTlJmZpYtJykzM8uWk5SZmWXLScrMzLLlJGVmZtly\n",
       "kjIzs2w5SZmZWbacpMzMLFtOUmZmli0nKTMzy5aTlJmZZSu7JCUxR+KEXrbvKfGOyv1DJU4anujM\n",
       "zGw4ZZekgOhj+17AO9dUDi6N4MudDQkajWIp16u3APPnN+tU6wLMnl1sb9VnvZ92YwLMmtX9frV9\n",
       "2X91nFZj1vWnTn28duWzZ7feXo273Kd6WbuYWo3Z37Leyvsao69+yvatHoPe+mv1PGqn2l9Zt3ye\n",
       "1cfsTX/2b7BtBtP3UMewro4PrV/vrQzkdTzaZJGkJE6RuEvil8DrU9kiiSlpfVuJeyU2Aj4PHCGx\n",
       "VOK9Eh+SOCPV207iRxI3pGX/VD4t1V8qcbPEFgONsa8ktWBB+yS1cGGxvVWf9X7ajQmwaFH7JFX2\n",
       "Xx2n1Zh1/alTH69d+cKFrbdX4y73qV7WLqZOJam+xuirn7J9q8dgqJJUtb+ybvk8q4/ZGyep0Tk+\n",
       "9D9JDeR1PNqMGekAUiI6AtgT2Ai4Gbgpbe52VBXB8xL/CEyJ4PjU/oOVKqcDp0VwjcROwGXA7sAJ\n",
       "wMciuE5iM+DZTu6TmZkNjRFPUsABwMURPAM8I3FJH/WVllYOASaquXVLic2Ba4DTJC5IYz3QqvGc\n",
       "OXPWrHd1ddHV1dXPXTAzWz80Gg0aw3iYmUOSClonnReADdP6Jv3sS8A+ETxXK/+yxELgfwHXSLwt\n",
       "grvqjatJyszMeqp/gJ87d25Hx8vhmtQvgJkSm0hsCRyayldBcU0KOLxS/0lgy8r9aoK7AorTgAAS\n",
       "k9PtayK4PYKvADeSrnuZmVneRvxIKoKlEhcCy4A/ADdQHF39C/ADiY8AP6F5fWoRcLLEUuCfU3m5\n",
       "7XjgGxLLKPZtMfAx4JMS04GXgNuA/x5onNUzf+V6tWzmTJg8uXXbGTNg113b99nurGK9fPr09nGM\n",
       "HduMoxpTX/pTp1089fIZM1pvr8Zd3i5b1n7/qzG1GrO/Zb2V9zVGX/2U7Vs9Br311+rxa6c+RllW\n",
       "Ps+qY/ZmMGet+9umk2fER/ps+0iPD/1/vgzkdTzaKKKvb3yvHySF58LMbGAkERHtview1nI43Wdm\n",
       "ZtaSk5SZmWXLScrMzLLlJGVmZtlykjIzs2w5SZmZWbacpMzMLFtOUmZmli0nKTMzy5aTlJmZZctJ\n",
       "yszMsuUkZWZm2XKSMjOzbDlJmZlZtpykzMwsW05SZmaWLScpMzPLlpOUmZlly0nKzMyy5SRlZmbZ\n",
       "cpIyM7NsOUmZmVm2nKTMzCxbTlJmZpYtJykzM8uWk5SZmWXLScrMzLLlJGVmZtnqaJKSmCnxksTr\n",
       "O9T/FInTO9G3mZmNPEVE5zoXFwKbAjdHMGeI+x4TwQtD15+ik3NhZrYukkREqFP9d+xISmILYB9g\n",
       "NnBEKuuSWCyxQOJuiXkSx0jcIHGrxC6p3nYSP0rlN0jsn8rnSJwvcTVwnsQ0iUvL8STOSf0sk5iV\n",
       "ys+UuFHiNql/iXL+fGg0ivVGo7kMh+q4neh3XTDc+zKa5m7+/GIZaJvZs4ulbFv2Uy+DnvPRaBT1\n",
       "6mXl66i3+WvX16xZvcdcb1fd73osreKfPbvZRxnn9tsXt7Nmwc4794y92r7cv/J+2Ud1n6vL/Pkw\n",
       "aVKxPmlSMcbUqc1t5VxPmlTcVrdNmlTcnz+/aFc+VvX+R9PzdCDGdLDvdwGXRfBbiUck3pTK3wjs\n",
       "BjwG3At8M4K3SBwPfAL4NHA6cFoE10jsBFwG7J7a7wZMjeBZia7KeP8IPBbBGwEkxqbyUyJ4TGJD\n",
       "4CqJSREs7y3wBQvg8cehq6v7A9/V1abBEGo0muMO5XhD3d9IGu59GU1zt2BBcfupTw2szapVxfqE\n",
       "CUXbsp9Vq7qXfepTPeej0YCFC+HrX+9e1mgUryNoP3/t+irbtVNvV93veixl3NX4Fy6Ebbct+ihf\n",
       "7w8/XGxbtAiefLL52i/HqbYvYyjvl/ta3ed6vCtWNG9/9zt45pnuiXDVKrj/fnjqKXjooea2FStg\n",
       "zJhiueUWGJve2bbdtnv/5XvWuqaTSeoo4LS0/sN0fyFwYwQPA0isBC5PdW4Dpqf1Q4CJah5Abimx\n",
       "ORDAJRE822K8g0lHbAARlE+VIyT+hmJfX0GR7HpNUmZmloeOJCmJ8RQJ5w0SAWxIkWB+At0SzEuV\n",
       "+y9V4hGwTwTP1foF+HNvQ9fq7wycAOwdwRMS5wCbtGs8Z84coPhEs2pVF3Q7UDMzs0ajQWMYzy12\n",
       "6kjqcOC8CP62LJBoAAf2s/0VwPHAv6S2e0awrI82VwIfpzhdWJ7u2wp4GnhS4uXAO4BF7Took1Sj\n",
       "UZziMDOz7rq6uuiqnFecO3duR8fr1BcnjgR+XCu7KJW3+wpdVLYdD+ydvgBxO3BcrV6rNv8PGCex\n",
       "XOIWoCsltqXAncAFwNWD3B8zMxsBHTmSiuCgFmVnAGfUyqZX1hcDi9P6oxQJrd7H3Nr9apungQ+1\n",
       "aHPsQOOfORMmTy7Wh/tCZDneUI+7Ll1QHanHZDSYOXNwbVauLNZ33bV7PytX9iyrz0dXF6xe3bNs\n",
       "7Njm66iddn098MDA2lX3e8aM1tuq8a9e3eyjfL2ffXZRtmxZ8QWFdmOU5eUXGKr72m6fx46FRx8t\n",
       "6l50UTGnjzzSbAvFXC9eDNOmNccv2229dTH+uHGwww4956A/cz1adfTvpEYT/52UmdnAjdq/kzIz\n",
       "M1tbTlJmZpYtJykzM8uWk5SZmWXLScrMzLLlJGVmZtlykjIzs2w5SZmZWbacpMzMLFtOUmZmli0n\n",
       "KTMzy5aTlJmZZctJyszMsuUkZWZm2XKSMjOzbDlJmZlZtpykzMwsW05SZmaWLScpMzPLlpOUmZll\n",
       "y0nKzMyy5SRlZmbZcpIyM7NsOUmZmVm2nKTMzCxbTlJmZpYtJykzM8uWk5SZmWXLSWqUaTQaIx1C\n",
       "n0ZDjOA4h5rjHFqjJc5Oc5IaZUbDE3c0xAiOc6g5zqE1WuLsNCcpMzPLlpOUmZllSxEx0jFkQZIn\n",
       "wsxsECJCnerbScrMzLLl031mZpYtJykzM8vWep+kJL1d0p2SfiPppGEYb0dJiyTdLuk2Scen8vGS\n",
       "rpT0a0lXSBpbafOZFN+dkt5aKZ8iaXnadnql/GWSLkzl10t69VrEu6GkpZIuzTVOSWMl/UjSCkl3\n",
       "SNon0zg/kx735ZK+m/od8TglfUvSw5KWV8qGJS5JH0xj/FrSBwYR51fT475M0sWSth7JOFvFWNl2\n",
       "gqSXJI3PcS5T+SfSfN4m6csjHScAEbHeLsCGwEpgArARcAswscNjbg9MTutbAHcBE4GvACem8pOA\n",
       "eWl99xTXRinOlTSvJd4AvCWt/xR4e1r/GHBmWj8C+P5axPt3wAXAJel+dnEC5wIfTutjgK1zizON\n",
       "dQ/wsnT/QuCDOcQJHADsBSyvlHU8LmA8cDcwNi13A2MHGOdfAxuk9XkjHWerGFP5jsBlwL3A+Ezn\n",
       "cjpwJbBRur/dSMcZEet9ktoPuKxy/2Tg5GGOYQFwCHAn8PJUtj1wZ1r/DHBSpf5lwL7AK4AVlfIj\n",
       "gbMrdfZJ62OARwYZ26uAq9KT99JUllWcFAnpnhblucU5nuIDybjUx6UUb7BZxEnx5lN9w+p4XMBR\n",
       "wFmVNmcDRw4kztq2WcB3RjrOVjECPwTeSPckldVcAj8ADmpRb0TjXN9P9+0A/K5y//5UNiwkTaD4\n",
       "NPMrijeEh9Omh4GXp/VXprhKZYz18gdoxr5mvyLiBeCJ6imGATgN+AfgpUpZbnHuDDwi6RxJN0v6\n",
       "pqTNc4szIv4I/CvwW+BB4PGIuDK3OCs6Hdc2vfQ1WB+m+DSfVZyS3gXcHxG31jZlE2PyWuDAdHqu\n",
       "IWnvHOJc35NUjNTAkrYALgI+GRFPVbdF8RFjxGIDkDQD+ENELAVa/g1EDnFSfEp7E8WphTcBT1Mc\n",
       "Ea+RQ5ySXgN8iuLT6yuBLSQdXa2TQ5yt5BpXlaRTgOci4rsjHUuVpM2AzwKnVotHKJy+jAHGRcS+\n",
       "FB9OfzBZmLzQAAAFFElEQVTC8QBOUg9QnCsu7Uj3LN8RkjaiSFDnR8SCVPywpO3T9lcAf2gT46tS\n",
       "jA+k9Xp52Wan1NcYYOv0SX4g9gcOk3Qv8D3gIEnnZxjn/RSfUm9M939EkbQeyizOvYFrI+LR9Mny\n",
       "YorTzbnFWer04/xoi74G9fqT9CHgncD7K8W5xPkaig8my9Jr6VXATZJenlGMpfspnpek19NLkrYd\n",
       "8Th7Oxe4ri8UnxzupngSbczwfHFCwHnAabXyr5DO+1IcCdQvAG9McWrrbpoXLX8F7JP6rF+0PCua\n",
       "54kH/cWJ1Mc0mteksosT+AXwurQ+J8WYVZzAnsBtwKap/3OBj+cSJz2vT3Q8LorrdPdQXEAfV64P\n",
       "MM63A7cD29bqjVic9Rhr26rXpHKby+OAuWn9dcBvs4hzsG9c68oCvIPigvZK4DPDMN5Uims8twBL\n",
       "0/L29OBdBfwauKL6wFGcLlhJcTH7bZXyKcDytO1rlfKXURyq/wa4HpiwljFPo/ntvuzipEgANwLL\n",
       "KD4Jbp1pnCdSvKEup0hSG+UQJ8WR8oPAcxTXEY4drrjSWL9JywcHGOeHU7v7aL6WzhzJOCsxPlvO\n",
       "ZW37PaQklclcrokzPR/PT+PeBHSNdJwR4X+LZGZm+Vrfr0mZmVnGnKTMzCxbTlJmZpYtJykzM8uW\n",
       "k5SZmWXLScrMzLLlJGU2AJJOk/TJyv3LJX2zcv9fJX16kH13Kf0kSottUyX9Kv2MwgpJf1PZtl3a\n",
       "dlOq9x4VP1nys0HE8NnBxG7WKU5SZgNzNcW/jELSBsA2FH+RX9oPuKY/HaX2/am3PcXPpRwXERMp\n",
       "/iD8OEnvTFUOBm6NiCkRcTXwf4D/GxEH96f/ms8Moo1ZxzhJmQ3MdRSJCGAPin919JSKH158GcVv\n",
       "g90s6eD0X9lvlfSfkjYGkLRK0jxJNwHvUfGjmyvS/Vltxvw4cE5E3AIQxf9AOxE4WdKewJeBd6n4\n",
       "ccp/Av4K+Jakr0jaQ9INaduy9I9ukXR0OvpaKulsSRtImgdsmsrO78DcmQ3YmJEOwGw0iYgHJb0g\n",
       "aUeKZHUdxU8N7Ac8CdxK8WOa51D8Ns9KSecCfwucTvHfxFdHxBRJm1D826HpEXG3pAtp/d/Gdwe+\n",
       "XSu7CdgjIpalxDQlIspfeZ4OnBARN0v6GjA/Ir6b/tHnGEkTgfcC+0fEi5LOBN4fESdL+nhE7DVU\n",
       "82W2tnwkZTZw11Kc8tufIkldl9bLU32vB+6NiJWp/rnAgZX2F6bb3VK9u9P979D+Zxx6+3kH9bL9\n",
       "OuCzkk6k+P9pz1CcHpwCLJG0FDiI4h+HmmXHScps4K6hOKU2ieKfa15PM2ld26K+6H6E9HSbftsl\n",
       "mjsokkrVFIpTjb2KiO8BhwJ/AX6ajrIAzo2IvdKyW0R8vq++zEaCk5TZwF0LzAAejcJjFD89sF/a\n",
       "9mtgQnn9BzgGWNyinztTvV3S/aPajPcN4EPp+hPpF07nUfycRq8k7RwR90bEGcB/USTWnwGHS9ou\n",
       "1RkvaafU5Pl0WtAsC05SZgN3G8W3+q6vlN1K8ZPwf0yn1I4FfijpVuAF4OxUb80RVar3EeAn6YsT\n",
       "D9PimlREPAQcDXxT0gqKI7n/jIifVPps93MG75V0WzqttwdwXkSsAD4HXCFpGcVPcWyf6v87cKu/\n",
       "OGG58E91mJlZtnwkZWZm2XKSMjOzbDlJmZlZtpykzMwsW05SZmaWLScpMzPLlpOUmZlly0nKzMyy\n",
       "9T9kxgDDol5HqgAAAABJRU5ErkJggg==\n"
      ],
      "text/plain": [
       "<matplotlib.figure.Figure at 0x145249b0>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# see where in a text certain words are found to occur\n",
    "text4.dispersion_plot([\"citizens\", \"democracy\", \"freedom\", \"duties\", \"America\"])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "44764"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# count of all tokens (including punctuation)\n",
    "len(text3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "2789"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# number of distinct tokens\n",
    "len(set(text3))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[u'among',\n",
       " u'the',\n",
       " u'merits',\n",
       " u'and',\n",
       " u'the',\n",
       " u'happiness',\n",
       " u'of',\n",
       " u'Elinor',\n",
       " u'and',\n",
       " u'Marianne',\n",
       " u',',\n",
       " u'let',\n",
       " u'it',\n",
       " u'not',\n",
       " u'be',\n",
       " u'ranked',\n",
       " u'as',\n",
       " u'the',\n",
       " u'least',\n",
       " u'considerable',\n",
       " u',',\n",
       " u'that',\n",
       " u'though',\n",
       " u'sisters',\n",
       " u',',\n",
       " u'and',\n",
       " u'living',\n",
       " u'almost',\n",
       " u'within',\n",
       " u'sight',\n",
       " u'of',\n",
       " u'each',\n",
       " u'other',\n",
       " u',',\n",
       " u'they',\n",
       " u'could',\n",
       " u'live',\n",
       " u'without',\n",
       " u'disagreement',\n",
       " u'between',\n",
       " u'themselves',\n",
       " u',',\n",
       " u'or',\n",
       " u'producing',\n",
       " u'coolness',\n",
       " u'between',\n",
       " u'their',\n",
       " u'husbands',\n",
       " u'.',\n",
       " u'THE',\n",
       " u'END']"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# the texts are just lists of strings\n",
    "text2[141525:]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "FreqDist({u',': 18713, u'the': 13721, u'.': 6862, u'of': 6536, u'and': 6024, u'a': 4569, u'to': 4542, u';': 4072, u'in': 3916, u'that': 2982, ...})"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# build a frequency distribution\n",
    "fdist1 = FreqDist(text1) \n",
    "fdist1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[(u',', 18713),\n",
       " (u'the', 13721),\n",
       " (u'.', 6862),\n",
       " (u'of', 6536),\n",
       " (u'and', 6024),\n",
       " (u'a', 4569),\n",
       " (u'to', 4542),\n",
       " (u';', 4072),\n",
       " (u'in', 3916),\n",
       " (u'that', 2982),\n",
       " (u\"'\", 2684),\n",
       " (u'-', 2552),\n",
       " (u'his', 2459),\n",
       " (u'it', 2209),\n",
       " (u'I', 2124),\n",
       " (u's', 1739),\n",
       " (u'is', 1695),\n",
       " (u'he', 1661),\n",
       " (u'with', 1659),\n",
       " (u'was', 1632)]"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "fdist1.most_common(20)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "906"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "fdist1['whale']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "image/png": [
       "iVBORw0KGgoAAAANSUhEUgAAAZQAAAEZCAYAAACw69OmAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\n",
       "AAALEgAACxIB0t1+/AAAIABJREFUeJztnXuc1FX9/58vQQEVXVEDvJviBUVBECxXRSvDvJcpWmpK\n",
       "3tAfWmYudlGzzEtlakHfvCRewzTvhpiK14BSNvGCookJChmyAtqqwPv3xznjDrADOzuf2Tm7834+\n",
       "HvPYz+fM57zm9ZnZnfee8z4XmRmO4ziOUyprVNqA4ziO0zHwgOI4juNkggcUx3EcJxM8oDiO4ziZ\n",
       "4AHFcRzHyQQPKI7jOE4mlC2gSLpe0jxJ0/PKekh6WNKrkiZKqsl7brSkmZJmSNo/r3ygpOnxuSvz\n",
       "yrtIGh/LJ0vaMu+54+NrvCrpuHLdo+M4jtNEOVsofwCGrVBWBzxsZtsBj8RzJPUFjgL6xjpjJCnW\n",
       "GQuMMLM+QB9JOc0RwPxYfgVwadTqAfwYGBwf5+cHLsdxHKc8lC2gmNmTwIIVig8BxsXjccBh8fhQ\n",
       "4DYz+8TMZgGvAUMk9Qa6m9nUeN2NeXXyte4EvhCPvwxMNLMGM2sAHmblwOY4juNkTFvnUHqa2bx4\n",
       "PA/oGY83AWbnXTcb2LSZ8jmxnPjzLQAzWwK8L2nDVWg5juM4ZaRiSXkLa774ui+O4zgdhM5t/Hrz\n",
       "JPUys7mxO+s/sXwOsHnedZsRWhZz4vGK5bk6WwBvS+oMrG9m8yXNAYbm1dkceLQ5M9tuu60tXryY\n",
       "efNCo2mbbbahe/fu1NfXA9C/f38AP/dzP/fzqj/v2TN0KM2bNw8zy+W4l8fMyvYAtgKm551fBpwb\n",
       "j+uAS+JxX6AeWAvYGngdUHxuCjAEEPAgMCyWjwTGxuPhwB/jcQ/gX0ANsEHuuIA/K4Xzzz+/pPod\n",
       "SSMFD1lopOAhFY0UPKSikYKHVDTi92az3/lla6FIug3YB9hI0luEkVeXALdLGgHMAo6M3+ovSbod\n",
       "eAlYAoyMxnOB4wagG/CgmU2I5dcBN0maCcyPQQUze0/SRcDf43UXWkjOr0Qu4raWxsbGkup3JI0U\n",
       "PGShkYKHVDRS8JCKRgoeUtIoRNkCipkdXeCpLxa4/mLg4mbKnwX6NVP+ETEgNfPcHwjDlh3HcZw2\n",
       "oqpnyudyJ61l2LDSRyN3FI0UPGShkYKHVDRS8JCKRgoeUtIohJp6lqoPSVbN9+84jlMskgom5au6\n",
       "hZIbxdBaGhqaTc1UpUYKHrLQSMFDKhopeEhFIwUPKWkUoqoDiuM4jpMd3uVVxffvOI5TLN7l5TiO\n",
       "45Sdqg4onkPJTiMFD1lopOAhFY0UPKSikYKHlDQKUdUBxXEcx8kOz6FU8f07juMUi+dQHMdxnLJT\n",
       "1QHFcyjZaaTgIQuNFDykopGCh1Q0UvCQkkYhqjqgOI7jONnhOZQqvn/HcZxiWVUOpa032HIcx3Ha\n",
       "GDN4/XV44gnYZhvYZ5/yvE5Vd3l5DiU7jRQ8ZKGRgodUNFLwkIpGCh6K0Vi2DKZPhzFjYPhw2HRT\n",
       "6NMHRoyABx4oXw7FWyiO4zjtnCVLYNq00AJ54gl48klYsGD5azbaCPbeGwYOLJ+PiuRQJJ0JfJuw\n",
       "re81ZnalpB7AeGBL4m6OuZ0WJY0GTgSWAqPMbGIsH0jYzbErYTfHM2N5F+BGYDfCbo5Hmdmbzfjw\n",
       "HIrjOO2Oxkb4+9+bAsgzz8Dixctfs9lmoWtr773DY/vtQc3vBF8Uq8qhtHlAkbQzcBuwO/AJMAE4\n",
       "FTgF+K+ZXSbpXGADM6uT1Be4NV6/KfBXoI+ZmaSpwBlmNlXSg8BVZjZB0khgZzMbKeko4HAzG96M\n",
       "Fw8ojuMkz+LFIWjkAsiUKfDxx8tf06dPU/DYe2/YcstsAsiKpDaxcQdgipk1mtlS4HHga8AhwLh4\n",
       "zTjgsHh8KHCbmX1iZrOA14AhknoD3c1sarzuxrw6+Vp3Al9ozojnULLTSMFDFhopeEhFIwUPqWi0\n",
       "tYeFC+HBB+Hcc2GPPaCmBr785ZD/ePLJEEz69YPTT4fx4+Htt+HVV+Haa+G442CrrQoHk3LOQ6lE\n",
       "DuUF4Gexi6sR+ArwD6CnmeX25J0H9IzHmwCT8+rPJrRUPonHOebEcuLPtwDMbImk9yX1MLP3ynA/\n",
       "juM4JbFgQch7PP54eEybFhLrOdZYAwYNgsMPhwsvhNpa6NGjcn4L0eYBxcxmSLoUmAh8ANQTciP5\n",
       "15iksvdFLVq0iLq6Orp27QrAoEGDqK2tpaamBmiK5IXOc2Utvb7Qeb5Wa+pncV5TU1PR+im9n6XW\n",
       "T+HzyL+H9v55pPB+Zv15vPsu/O1vDdTXw5131jB9Ouy6a7i+vr6Gzp1h+PAG+veHfv1q+PznYdmy\n",
       "nF7bvp/19fVMmjSJxsZGVkfFJzZK+hmhpXEmMNTM5sburMfMbAdJdQBmdkm8fgJwPvBmvGbHWH40\n",
       "sLeZnRavucDMJkvqDLxjZhs389qeQ3Ecp+zMndvU+nj8cXjppeWfX2stGDIkJNH32Qc+9zlYZ53K\n",
       "eF0dqeVQkPSZ+HML4KuEpPu9wPHxkuOBu+PxvcBwSWtJ2hroA0w1s7nAQklDJAk4Frgnr05O6wjg\n",
       "keZ8eA4lO40UPGShkYKHVDRS8JCKRrH1Z8+GW26Bk08Oo6t694ZLLmlg7NgQTLp2hX33hQsugMce\n",
       "g4aGkGy/6CL44hcLB5MU3otVUal5KHdI2pCQBxlpZu9LugS4XdII4rBhADN7SdLtwEvAknh9rlkx\n",
       "kjBsuBth2PCEWH4dcJOkmYRhwyuN8HIcx8mKWbOWb4H861/LP7/OOiEH8vWvhxbIoEHQpUtFrJaV\n",
       "ind5VRLv8nIcp1jM4LXXQosiF0D+/e/lr1lvvZA4z3Vh7bYbrLlmZfxmja/l5TiO00rM4JVXYNKk\n",
       "EDyeeCIM081ngw1gr72aAkj//tCpU0XsVpSqDihZ5FDyR3BUs0YKHrLQSMFDKhopeKiERn4AyT16\n",
       "926gvr6pfm4Zk1wA6dcvDO3NykPqGoWo6oDiOI5jBjNnhuR4LoDMnbv8NX37wpFHNgWQvn3LMwu9\n",
       "veM5lCq+f8epRnI5kEmTmoLIO+8sf81nPhNGYQ0dGh5ZrYPVEfAciuM4VUtuL5D8Lqw5c5a/ZuON\n",
       "Q+DIBZEddvAA0hqqOqB4DiU7jRQ8ZKGRgodUNFLw0FqNf/87tD4efTQ8Ntpo5RxIrvWx776w446r\n",
       "DiDt+b0oh0YhqjqgOI7TMZg3rymAPPZY6NLK57Ofha99rSmAeA6kPHgOpYrv33HaKwsWhCG8uRbI\n",
       "iy8u//x664Xgsd9+IYDsvPPqR2E5LcNzKI7jtGsWL4annmoKIM89F3IjObp1C/NA9tsvPAYMgM7+\n",
       "7dbmVPVb7jmU7DRS8JCFRgoeUtGopAczeP55uPtueO21Bv74xxqWLGl6fs01wwKKuQAyePCqlzJp\n",
       "z+9FihqFqOqA4jhOOixdCk8/HYLI3XfDG2+E8v79w94gQ4Y0dWHtuSesvXZl/Tor4zmUKr5/x6k0\n",
       "jY3w17/CXXfBvffCf//b9FzPnnDooXDQQWFW+vrrV86n04TnUBzHSYaGhrC97V13wV/+Ah980PTc\n",
       "ttuGXQkPOyxsfeuJ9PZFVX9cvh9KdhopeMhCIwUPqWhk6eHtt2Hs2LAv+sYbwze+AXfcEYLJbruF\n",
       "fUCmTw/7ol92GXz+803BpKO9Fx1BoxDeQnEcpyzMnBm6s268ESZPbirv1CnkQQ47LDy22KJyHp1s\n",
       "8RxKFd+/42TNSy+Flsedd4ZRWjm6dg2tk8MPDzmRDTesnEenNJLLoUgaDXwTWAZMB04A1gHGA1sS\n",
       "d2w0s4a8608ElgKjzGxiLB9I2LGxK2HHxjNjeRfgRmA3wo6NR5nZm210e45TNZiFrqo77giPl19u\n",
       "em799eGQQ0IQ2X//dPdId7KjzXMokrYCTgJ2M7N+QCfCFr11wMNmth1hD/i6eH1f4CigLzAMGBP3\n",
       "kAcYC4wwsz5AH0nDYvkIYH4svwK4tDkvnkPJTiMFD1lopOAhFY1C9c3g2Wdh9GjYbjvYddeQA3n5\n",
       "ZejRA048MSTd//MfuOqqBg4/vLRgkvJ7Ua0ahahEC2UhYS/5tSUtBdYG3gZGA/vEa8YBkwhB5VDg\n",
       "NjP7BJgl6TVgiKQ3ge5mNjXWuRE4DJgAHAKcH8vvBH5T7ptynI7MsmUwdWpTd9asWU3PbbwxfPWr\n",
       "cMQRYa+Q/K1uP/ywza06FaQiORRJJwO/BP4HPGRmx0paYGYbxOcFvGdmG0i6GphsZrfE564F/kLo\n",
       "FrvEzL4Uy/cCvm9mB0uaDnzZzN6Oz70GDDaz91bw4TkUxynA0qXwzDMhgNx5J8ye3fRc795NQWSv\n",
       "vapzu9tqJakciqRtgLOArYD3gT9J+mb+NWZmksr+Tb/NNttQV1dH165dARg0aBC1tbWfLkuQaxr6\n",
       "uZ9Xy7kZvPFGDTfdBM8/38B77/Hpsu9f+lID++wDQ4fW8LnPwcKFoX6nTun49/Psz+vr65k0aRKN\n",
       "jY2sFjNr0wchH3Jt3vmxwG+Bl4Fesaw3MCMe1wF1eddPAIYAvYCX88qPBsbmXbNHPO4MvNucl/79\n",
       "+1spLFiwoKT6HUkjBQ9ZaKTgoRIa//632c9/bta3r1nIkpj177/Att7a7JxzzCZPNlu6tLweUtZI\n",
       "wUMqGiFsNP/9XokcygzgR5K6AY3AF4GpwAfA8YQE+vHA3fH6e4FbJf0K2BToA0w1M5O0UNKQWP9Y\n",
       "4Kq8OscDk4EjCEl+x3HyWLQodGXddFPYQyTX+7vRRnD00eGxxx6+b4jTciqVQ/k+4Qt/GfAc8G2g\n",
       "O3A7sAUrDxs+jzBseAlwppk9FMtzw4a7EYYNj4rlXYCbgAGEYcPDzWxWMz6sEvfvOJViyZIw2fCm\n",
       "m8LSJ//7Xyjv0iUM8T32WBg2bPnEuuPks6ocik9srOL7d6oDM6ivD0Hk1lvD7oY59toLjjsuJNfL\n",
       "tKK508FYVUDxtbxKIJUx4SlopOAhC40UPGSlMXNmA5ddBrvsEtbLuuKKEEy22y7MG/nXv+CJJ+Db\n",
       "324+mKRyHylopOAhJY1C+FpejtOB+PDD0JV1ww1hKfj6+lC+4YYwfHjo0ho82PMiTnnwLq8qvn+n\n",
       "Y2AWFl/8wx9g/HhYuDCUr7UWHHxwCCIHHBDOHadUkpqH4jhONrz9dsiL3HADzJjRVD54MJxwAhx1\n",
       "FGywQcXsOVWI51BKIJX+zBQ0UvCQhUYKHlal8dFHYfmTAw+EzTeHuroQTHr2hO99D154AaZMgVNP\n",
       "BaljvxdtqZGCh5Q0CuEtFMdJHDOYNi10ad16K7wXFxBac82wn8gJJ4Shvp39r9mpMJ5DqeL7d9Lm\n",
       "3XfhlltCIMnfW6R//xBEjjkmTEJ0nLbEcyiO005YsiTss3799XD//eEcwiitb3wjBJISe2odp2x4\n",
       "DqUEUunPTEEjBQ9ZaFTKw5tvwo9/DFtuGWasz5oVFmo86KCwPMrbb8OVVxYXTNrre5GiRgoeUtIo\n",
       "hLdQHKdCfPJJaIVccw1MmNC0llafPnDyySE/0rt3ZT06TjF4DqWK79+pDG+8AddeG7q15s4NZWut\n",
       "FZY/Oflk2Htvn3jopEtJORRJ6wL/M7OlkrYHtgf+YmEHRcdxWsDHH8O994bWyMSJTeU77ggnnRTW\n",
       "09pww8r5c5wsaEkO5Qmgi6RNgYcIy8TfUE5TbYXnULLTSMFDFhpZe3jttTBXZPPN4etfD8Gka9cw\n",
       "e/3JJ+HFF+E731k5mHTE96I9a6TgISWNQrQkhyIz+1DSCGCMmV0m6Z9lc+Q47ZyPPw5LoPz+9/Do\n",
       "o03lO+8curS++U2fwe50TFabQ5E0DRgJXAGMMLMXJU03s35tYbCceA7FyZLZs+G3vw35kf/+N5R1\n",
       "6xaWQDn5ZN+syukYlDoP5SxgNHBXDCbbAI9ladBx2jN//3tYGv5Pf2qaN7LLLnDKKWHyoe8z4lQL\n",
       "Lcmh9DSzQ8zsUgAzex14qrUvKGl7SdPyHu9LGiWph6SHJb0qaaKkmrw6oyXNlDRD0v555QMlTY/P\n",
       "XZlX3kXS+Fg+WdKWzXnxHEp2Gil4yEKjpfWXLg3zQ2prw2KMt90Whv0eeSQ89VQD9fUwcmTrg0l7\n",
       "ei+qQSMFDylpFKIlAWV0C8tahJm9YmYDzGwAMBD4ELgLqAMeNrPtCHvA1wFI6gscBfQFhgFjpE87\n",
       "DsYSuuH6AH0kDYvlI4D5sfwKwj71jlMyCxfCr38N224bhvk+/TSsv35YmPFf/wq5k5128q4tpzop\n",
       "mEORdADwFcKX+R+B3J9Id6CvmQ0u+cVDa+NHZraXpBnAPmY2T1IvYJKZ7SBpNLAs10KSNAG4AHgT\n",
       "eNTMdozlw4GhZnZqvOZ8M5siqTPwjplt3Mzrew7FaRGzZsFVV4X8yKJFoWybbeDMM+Fb34Lu3Svp\n",
       "znHajtbmUN4GngUOjT9zAguB72TkbThwWzzuaWa53a7nAT3j8SbA5Lw6s4FNgU/icY45sZz48y0A\n",
       "M1sSu9V6mNl7Gfl2qgAzeOaZkB+56y5YtiyU77NPGOp70EHQqVNlPTpOShQMKGb2T+Cfkm4pxyRG\n",
       "SWsBBwPnNvPaJqnsTYe9996buro6unbtCsCgQYOora2lJnZ85/oaC53Pnj2bddddt8XXN3e+ePFi\n",
       "Nttss1bXz1FTU9Pq+vl1K1U/lfdz8eLF9Oy5GXfcAffc08Arr0B9fQ2dO8PZZzdwxBEweHD6n0cW\n",
       "72cKn0cq72cKn0el3s/6+nomTZpEY2Mjq8XMVvkAaoGHgZnAG/Hxr9XVa4HuocCEvPMZQK943BuY\n",
       "EY/rgLq86yYAQ4BewMt55UcDY/Ou2SMedwbebc5D//79rRQWLFhQUv2OpJGCh1I1Fi0yGzNmgW22\n",
       "mVlon5j16GF23nlmc+a0jYeUNFLwkIpGCh5S0Qhho/nv9ZbMQ3mFMHT4OWBpXiD67+rD1Sp1/0hY\n",
       "wmVcPL+MkEi/VFIdUGNmdTEpfyswmNCV9VdgWzMzSVOAUcBU4AHgKjObIGkk0M/MTou5lcPMbHgz\n",
       "Hmx19+90fJYtC/uOnHsuvPNOKNthBzjrrDCjfe21K+vPcVJiVTmUlgSUKWY2JGND6xCS6lub2aJY\n",
       "1gO4HdgCmAUcaWYN8bnzgBOBJcCZZvZQLB9IWAamG/CgmY2K5V2Am4ABwHxguJnNasaHB5QqZ+pU\n",
       "GDUqbJsLsPvucOGF8OUvwxpVvbmD4zRPqQHlEqAT8Gfgo1y5mT2XpclKMGDAAJs2bVqr6zc0NHza\n",
       "31jtGil4KEbjnXdg9GgYNy6c9+oFl1wCBx/cQI8e7ec+yqmRgodUNFLwkIpGqTPl9wAMGLRC+b6t\n",
       "duQ4FeKjj8I8kp/+FBYvDsvGf+c78IMfhKG/ZZzz5TgdHt8PpYrvv5owg/vug+9+F15/PZQdcgj8\n",
       "8pdhkqLjOC2j1P1Qzie0UBR/AmBmP8nMoeOUkZdeCq2Q3D4kO+4YWin777/qeo7jFEdL0o4fxMdi\n",
       "YBlh9vxWZfTUZvhaXtlppOBhRY0FC8JIrV12CcFk/fVDIPnnPwsHkxTvo1IaKXhIRSMFDylpFGK1\n",
       "LRQz+0X+uaTLgYkFLnecirN0aVgi5Yc/DMvIr7EGnHoq/OQnsPFKC/A4jpMVRedQ4vDeqWbW7nue\n",
       "PYfS8Xj88bC+1j/jFnB77w1XXgklNkYdx4mUmkOZnne6BvAZwPMnTlK89RacfXbYkwRgiy3gF78I\n",
       "KwL7yr+O0za0JIdycHwcBOwPbGJmV5fVVRvhOZTsNCrl4eOP4dJLw8z2P/0JBg9u4MIL4eWXwx7u\n",
       "xQaTFN7LVDRS8JCKRgoeUtIoREtyKLMk9Qf2IozyehLwPeWdivPoo3DGGSF4AHzta/Czn8H221fW\n",
       "l+NUKy2ZKX8mcBJhpryAw4BrzOyq8tsrL55DaZ/MmRM2tPrjH8N5nz5w9dVhuRTHccpLqUuvTCes\n",
       "3PtBPF8HmGxm/TJ32sZ4QGlffPJJCBznnx9muXfrFma4f+970KVLpd05TnWwqoDS0uXvlhU4btd4\n",
       "DiU7jXJ7eOIJ2G23kHhfvBgOPTRMWPzBD5YPJqnfR3vSSMFDKhopeEhJoxAtWcvrD8AUSfldXteX\n",
       "zZHj5DF3LpxzDtx8czj/7GfDVrwHHlhZX47jrEyL5qHEZeJriUl5M2v9Er0J4V1e6bJkCYwZAz/6\n",
       "ESxcGFoho0fD978furocx6kMrcqhSBoMbGRmD65Q/hVgnpk9m7nTNsYDSpo88wyMHNk0OfErXwmt\n",
       "km22qawvx3Fan0O5FHipmfKXgF80U16MoRpJd0h6WdJLkoZI6iHpYUmvSpooqSbv+tGSZkqaIWn/\n",
       "vPKBkqbH567MK+8iaXwsnyxpy+Z8eA4lO40sPLzxRgMnngh77hmCyZZbwt13w/33tzyYpHAfHUUj\n",
       "BQ+paKTgISWNQqwqoHRvbpfDWLZRia97JWGHxR2BXQj7ydcBD5vZdsAj8Zy4BfBRQF9gGDBG+nS6\n",
       "2lhghJn1AfpIGhbLRxC2E+4DXEEIjk6iLFsGv/sdHHcc/OEPYY+SH/4wJN0PPdRnujtOe2FVXV6v\n",
       "FVqva1XPrfYFpfWBaWb22RXKZwD7mNk8Sb2ASWa2g6TRwDIzuzReNwG4gLCF8KMxKBH3jh9qZqfG\n",
       "a843symSOgPvmNlKywJ6l1flmTULRowIkxQhrAJ89dWw3XYVteU4TgFa2+X1iKSf5bUGkLSGpIuA\n",
       "R0vwszXwrqQ/SHpO0jVxbktPM5sXr5kH9IzHmwCz8+rPBjZtpnxOLCf+fAvAzJYA78dFLZ1EMINr\n",
       "roF+/UIw2WgjGD8eJkzwYOI47ZVVBZSzgW2A1yX9OQ4bnglsF59rLZ2B3YAxZrYbYa+VuvwLYrOh\n",
       "7E0Hz6Fkp1FM/bfegmHD4OSTw5ySr34VXnwR9t+/oeTuLe8rz04jBQ+paKTgISWNQhSch2Jmi4Hh\n",
       "krYBdiJ8wb9kZq+X+Jqzgdlm9vd4fgcwGpgrqZeZzZXUG/hPfH4OsHle/c2ixpx4vGJ5rs4WwNux\n",
       "y2t9M3tvRSPrrbcedXV1dO3aFYBBgwZRW1tLTU0YD5B74wudL168eJXPt+R88eLFJdXPp7X12+p8\n",
       "wYIGJkyAU0+tYeFCqK1t4Kyz4KtfrUGC2bMr/352pM+j1N9P//1O6/Oo1PtZX1/PpEmTaGxsZHVU\n",
       "ZE95SU8A3zazVyVdAKwdn5pvZpdKqgNqzKwuJuVvBQYTurL+CmxrZiZpCjAKmAo8AFxlZhMkjQT6\n",
       "mdlpMbdymJkNb8aH51DaiLffDi2SBx4I5wcfDP/3f9C7d2V9OY5THCWt5VUOJO0KXAusBbwOnAB0\n",
       "Am4ntCxmAUeaWUO8/jzgRGAJcKaZPRTLBwI3AN0Io8ZGxfIuwE3AAGA+MLy5EWseUMqPGdxyC/y/\n",
       "/wcNDWEb3quvhm9+00dvOU57JLmAkgoDBgywadNaP+m/oaHh0+ZhtWs0V3/ePDjlFLjnnnB+wAEh\n",
       "Eb/pps0IZOAhC40UPKSikYKHVDRS8JCKRsmLQ0raS9IJ8XhjSVu32o3T4TELI7Z22ikEk+7d4brr\n",
       "QndXoWDiOE77pyXL118ADAS2N7PtJG0K3G5me7aBv7LiXV7Z8+67YdmUO+4I51/6Elx7bdiS13Gc\n",
       "9k+pLZTDgUMJw3sxszlA9+zsOR2FO+8MrZI77oB11gmz3x96yIOJ41QLLQkoH5nZp3ugxEmIHQKf\n",
       "h5KNxvz5cPbZDRxxRGihDB0K06eH/EkxifdK30cqHlLRSMFDKhopeEhJoxAtCSh/kvR/QI2kkwnr\n",
       "bF1bNkdOu+K++2DnncNs97XXDiO4HnkEtvYsm+NUHS3dD2V/ILfK70Nm9nBZXbURnkNpPe+/D2ed\n",
       "BTfcEM5ra8PCjtu2aoU3x3HaC6XuKX828MeYO+lQeEBpHRMnhgUdZ88OG19dfDGceSZ06lRpZ47j\n",
       "lJtSk/LdgYmSnpJ0hqSeq63RTvAcSnEaixbBqafCl78cgsngwVBfD9/9Lixa1H7uI3UPqWik4CEV\n",
       "jRQ8pKRRiNUGFDO7wMx2Ak4HegNPSHqkbI6cJHn8cdh117Bcypprws9+Bk8/DTvsUGlnjuOkQotn\n",
       "yscFG48AjgbWNbNdymmsLfAur9Xz4Ydw3nlwZdwPs39/GDcOdmn3n77jOK2hpC4vSSMlTSKM7tqI\n",
       "sKijf51UAZMnw4ABIZh06gQ/+hFMmeLBxHGc5mlJDmUL4Cwz62tm55tZc/vMt0s8h9K8xkcfQV1d\n",
       "2Nv91Vehb98QXH7yk7A9b1t4qJRGCh5S0UjBQyoaKXhISaMQBfdDkbSemS0ELgdsxR0Pm9tfxGn/\n",
       "PPssHH982PBKgu9/Hy68EOKWMY7jOAVZ1Z7yD5jZgZJm0czuiWbW7qeueQ6liU8+CYn2n/4Uli4N\n",
       "80nGjYPPf77SzhzHSQlfvr4AHlAC06eHVkluJf9Ro+DnPw8z3x3HcfIpNSm/0hDhjjJs2HMoYab7\n",
       "oEFg1sCWW4YlVK68svhgUun7yEojBQ+paKTgIRWNFDykpFGIggFFUjdJGwIbS+qR99iKsBVvq5E0\n",
       "S9LzkqZJmhrLekh6WNKrkiZKqsm7frSkmZJmxGVgcuUDJU2Pz12ZV95F0vhYPlnSlqX47YiYhe6t\n",
       "E06Ajz+GAw8MLZV99620M8dx2iuryqGcBZwJbAK8nffUIuD3ZvabVr+o9AYwMD+xL+ky4L9mdpmk\n",
       "c4ENVthTfnea9pTvE/eUnwqcYWZTJT3I8nvK72xmIyUdBRzue8o3sXQpnHFGWF5eCgs6nn56pV05\n",
       "jtMeKHUtr1FmdlXGht4ABpnZ/LyyGcA+ZjZPUi9gkpntIGk0sMzMLo3XTQAuAN4EHjWzHWP5cGCo\n",
       "mZ0arznfzKZI6gy8Y2YbN+Oj6gLKhx/CMceEnRS7dIFbb4WvfrXSrhzHaS+UlEMxs6sk7SzpSEnH\n",
       "5R4lejLgr5L+IemkWNbTzObF43lAbs2wTYDZeXVnE1oqK5bPoakrblPgreh/CfD+isOeofpyKPPn\n",
       "wxe/GIJJTQ389a9NwcT7mNPxkIpGCh5S0UjBQ0oahSg4DyVH3AJ4H2An4AHgAOAp4MYSXndPM3tH\n",
       "0sbAw7FoMv9WAAAfOElEQVR18imxO6u6mg5lZtYsGDYMXnkFNt8cJkwIExYdx3GyYrUBhbB+167A\n",
       "c2Z2Qlxt+JZSXtTM3ok/35V0FzAYmCepl5nNjeuG/SdePgfYPK/6ZoSWyZx4vGJ5rs4WwNuxy2v9\n",
       "5iZiLlq0iLq6OrrGWXuDBg2itraWmpowHiAXyQud58paen2h83yt1tRf3fmsWTUccAD06tXA4YfD\n",
       "1VfXsOmmy19fU1NT0uuVWj+l97PU+lmcp/B+llq/I72fKXwelXo/6+vrmTRpEo2NjayOluRQ/m5m\n",
       "u0t6FtgPWAjMMLPtV6vevN7aQCczWxS3E54IXAh8EZhvZpdKqgNqVkjKD6YpKb9tbMVMAUYBUwmt\n",
       "p/ykfD8zOy3mVg6r1qT8I4/A4YeHpeeHDoW774b116+0K8dx2iul7ofyd0kbANcA/wCmAc+U4Kcn\n",
       "8KSkemAKcL+ZTQQuAb4k6VVC4LoEIK4ddjvwEvAXYGReFBhJ2I54JvCamU2I5dcBG0qaCZwF1DVn\n",
       "pKPnUG69FQ44IASTo44K3VyFgon3MafjIRWNFDykopGCh5Q0CrHaLi8zGxkPfyfpIWA9M/tna1/Q\n",
       "zN4AVvomj11SXyxQ52Lg4mbKnwX6NVP+EXBkaz22d8zgl7+Ec84J59/5DvziF7BGS/59cBzHaSWr\n",
       "mocykGbW8MphZs+Vy1Rb0RG7vJYtg7PPhl//Opz/8pdhR0XHcZwsaNU8lLgHyqoCSrufU93RAkpj\n",
       "Y1iT6/bbw66KN94Iw1fKHDmO47SeVuVQzGyome1b6FE+u21HR8qhNDSEYcG33w7rrRfyJcUEE+9j\n",
       "TsdDKhopeEhFIwUPKWkUoiXzUI6n+eXrS5mH4mTIu++GCYovvAC9e8Nf/hL2f3ccx2lLWjJs+Dc0\n",
       "BZRuhBFYz5nZEWX2VnY6QpfXiy+Glsns2bDDDqFlsqUvhek4TpnIdD+UuArweDP7chbmKkl7DyhT\n",
       "poRg0tAQNsK67z7osdICM47jONlR6jyUFfkQaPe7NUL7zqE8+WRYl6uhAU4/vYG//rW0YOJ9zOl4\n",
       "SEUjBQ+paKTgISWNQrQkh3Jf3ukaQF/CREOnQjz6KBx8cFg5+Oijw57v3bpV2pXjONVOS3IoQ/NO\n",
       "lwBvmtlb5TTVVrTHLq8JE8JSKo2N8K1vwbXXQqdOlXblOE61kEkORdJ65LVomltssb3R3gLKPffA\n",
       "kUeGHRZPOQXGjPHZ747jtC2l7il/iqS5wHTg2fj4R7YWK0N7yqH86U9wxBEhmJx5Jowd2xRMUuhX\n",
       "TcFDFhopeEhFIwUPqWik4CEljUK0ZPn6cwjb6f63bC6cVXLzzWEG/LJlcO658POfh617HcdxUqIl\n",
       "OZSJhD3ZP2gbS21He+jyuu46OOmksODj+eeHhwcTx3EqRal7yu8G3AD8Dfg4FpuZjcrSZCVIPaCM\n",
       "GQOnnx6Of/5zqGt2EX7HcZy2o9R5KL8nbGo1mZA7yeVR2j0p51B+9aumYHLFFasOJin0q6bgIQuN\n",
       "FDykopGCh1Q0UvCQkkYhWpJD6WRmvgB6G3LxxfCDH4TjMWPgtNMq68dxHKcltKTL62LgTeBe4KNc\n",
       "eanDhiV1IrR4ZpvZwZJ6AOOBLYFZwJFm1hCvHQ2cCCwFRsUdHnN7ttwAdAUeNLMzY3kX4EZgN2A+\n",
       "cJSZvdmMh6S6vHJ5kosuCnmS666DE06otCvHcZwmSu3yOoawhe4zNHV3ZdHldSZhW9/cN3od8LCZ\n",
       "bQc8Es+Je8ofRZihPwwYI32alh4LjDCzPkAfScNi+QjC/vR9gCuASzPwW1bMwgiuiy4KExVvvtmD\n",
       "ieM47YvVBhQz28rMtl7xUcqLStoM+AphP/hccDgEGBePxwGHxeNDgdvM7BMzmwW8BgyR1BvobmZT\n",
       "43U35tXJ17oT+EJzPlLJoZiFuSWXXw6dO8P48XDMMW3vo5L1U9FIwUMqGil4SEUjBQ8paRSiUvuh\n",
       "XEGY37JeXllPM5sXj+cBPePxJoQBATlmA5sCn8TjHHNiOfHnW9HnEknvS+qR4uz+Zcvg1FPh97+H\n",
       "tdaCO+4I63Q5juO0N1qSlN+dZvZDIbQIikbSQcB/zGzaCuuEfYqZmaSyJzcWLVpEXV0dXbt2BWDQ\n",
       "oEHU1tZSU1MDNEXyQue5spZev+L5/PkNXH55CCZdu8K99zaw++4ArdMr5bympqai9bN4P1f8z6tS\n",
       "9VP4PPLvob1/Him8nyl8HpV6P+vr65k0aRKNjY2sjjbfDyUm+Y8lLDTZldBK+TMhcA01s7mxO+sx\n",
       "M9tBUh2AmV0S608AzicMFHjMzHaM5UcDe5vZafGaC8xssqTOwDtmtnEzXiqWlF+yBI47Dm67DdZZ\n",
       "B+6/H4YOrYgVx3GcFpPUfihmdp6ZbR7zMMOBR83sWMIosuPjZccDd8fje4HhktaStDXQB5hqZnOB\n",
       "hZKGxCT9scA9eXVyWkcQkvwrUakcytKlYaXg226Dz32ugYceKi2YpNCvmoKHLDRS8JCKRgoeUtFI\n",
       "wUNKGoVIYT+UXBPhEuB2SSOIw4YBzOwlSbcTRoQtAUbmNStGEoYNdyMMG54Qy68DbpI0kzBseHiG\n",
       "fkti6VI48US45RZYd92QiN9zz0q7chzHKZ3W7Icyy8xmF7i8XdHWXV7LloV1ua6/PnRzTZgAtbVt\n",
       "9vKO4zgls6our4ItFEl9CCOvJq1QXiupi5m9nq3Njk1uNNf114fdFR94wIOJ4zgdi1XlUH4NLGym\n",
       "fGF8rt3TVjkUMzjjDLjmmjCa6/77YZ99itPIwkc5NVLwkIVGCh5S0UjBQyoaKXhISaMQqwooPc3s\n",
       "+RULY1lJExuridykxbFjoUsXuPde2G+/SrtyHMfJnoI5FEmvmdm2xT7Xnih3DsUMvvtd+PWvw6TF\n",
       "e+6BYcNWX89xHCdVWjts+B+STm5G7CQ6yPL15SS3Ntevfw1rrgl//rMHE8dxOjarCihnASdIelzS\n",
       "r+LjccLCi2e1jb3yUq4cillYfj63Ntcdd8CBBxankYWPttRIwUMWGil4SEUjBQ+paKTgISWNQhQc\n",
       "5RVnrH8e2BfYmTBf5H4ze7RsbjoIF1wQdljs1Cks9HjIIZV25DiOU36KXnqlI1GOHMpPfhL2NOnU\n",
       "KcyE//rXM5V3HMepKFkvveIU4OKLQzBZYw246SYPJo7jVBdVHVCyzKFcdlnIm0gwbhwcfXTxGln4\n",
       "qJRGCh6y0EjBQyoaKXhIRSMFDylpFKKqA0pW/OpXYUSXFGbCf/OblXbkOI7T9ngOpcT7v+qqMHER\n",
       "wkz4b387A2OO4ziJ4jmUMvHb3zYFk9/9zoOJ4zjVTVUHlFJyKDffDNdeG/oif/MbOOWU1umk0ifq\n",
       "fczpeEhFIwUPqWik4CEljUJUdUBpLR9+CGefHY5/+Us4/fTK+nEcx0kBz6G04v6vvBLOOgsGDYKp\n",
       "U0My3nEcpxpIKociqaukKZLqJb0k6eexvIekhyW9Kmli3Ls+V2e0pJmSZkjaP698oKTp8bkr88q7\n",
       "SBofyydL2jIr/42NcOml4fjHP/Zg4jiOk6PNA4qZNQL7mll/YBdgX0m1QB3wsJltR9gDvg5AUl/g\n",
       "KMLWw8OAMXEPeYCxwAgz6wP0kZRbfnEEMD+WXwFc2pyX1uRQrrsO3nkH+veH2to0+jNT0EjBQxYa\n",
       "KXhIRSMFD6lopOAhJY1CVCSHYmYfxsO1gE7AAuAQYFwsHwccFo8PBW4zs0/MbBbwGjBEUm+gu5lN\n",
       "jdfdmFcnX+tO4AtZ+P7oI7jkknDsrRPHcZzlqUhAkbSGpHpgHvCYmb1I2NBrXrxkHtAzHm8C5O9h\n",
       "PxvYtJnyObGc+PMtADNbArwvqceKPurr64vyfcMNMHs29OsHhx4KNTU1q62zOjqKRgoestBIwUMq\n",
       "Gil4SEUjBQ8paRSi4GrD5cTMlgH9Ja0PPCRp3xWeN0llHy2wzTbbUFdXR9euXQEYNGgQtbW1n77h\n",
       "uaZhTU0NH38Mf/5zA/37w3nn1bDGGss/v+L1fu7nfu7nHeG8vr6eSZMm0djYyGoxs4o+gB8B3wNm\n",
       "AL1iWW9gRjyuA+ryrp8ADAF6AS/nlR8NjM27Zo943Bl4t7nX7t+/v7WUa681A7O+fc2WLg1lCxYs\n",
       "aHH9QnQUjRQ8ZKGRgodUNFLwkIpGCh5S0Qhho/nv80qM8tooN4JLUjfgS8A04F7g+HjZ8cDd8fhe\n",
       "YLiktSRtDfQBpprZXGChpCExSX8scE9enZzWEYQkf6v55BP42c/C8Q9/GFYTdhzHcZanzeehSOpH\n",
       "SJivER83mdnlMcdxO7AFMAs40swaYp3zgBOBJcCZZvZQLB8I3AB0Ax40s1GxvAtwEzAAmA8Mt5DQ\n",
       "X9GLteT+x42Db30LttsOXnop7HXiOI5TjaxqHopPbFzN/S9ZAn37wsyZIbAcd1wbmXMcx0mQpCY2\n",
       "pkRL5qGMHx+CyWc/C8ccs/xzqYwJT0EjBQ9ZaKTgIRWNFDykopGCh5Q0ClHVAWV1LF0KP/1pOP7B\n",
       "D6BzRcbEOY7jtA+8y2sV9z9+PAwfDltuGVopa67ZhuYcx3ESxLu8WsGyZXDRReH4vPM8mDiO46yO\n",
       "qg4oq8qh3HUXvPgibL45HH9889ek0p+ZgkYKHrLQSMFDKhopeEhFIwUPKWkUoqoDSiGWLYOf/CQc\n",
       "19VBly6V9eM4jtMe8BxKM/d/zz1w2GGwySbw+usQV2ZxHMepejyHUgRmTbmTc8/1YOI4jtNSqjqg\n",
       "NJdD+ctf4NlnoWdPOOmkVddPpT8zBY0UPGShkYKHVDRS8JCKRgoeUtIoRFUHlBUxa8qdfP/70K1b\n",
       "Zf04juO0JzyHknf/EyfCl78MG28Mb7wB66xTQXOO4zgJ4jmUFmAGF14Yjr/3PQ8mjuM4xVLVASU/\n",
       "h/LYY/DMM9CjB5x2Wsvqp9KfmYJGCh6y0EjBQyoaKXhIRSMFDylpFKKqA0o+udzJd78L3btX1ovj\n",
       "OE57xHMoZjz+OAwdCjU1MGsWrL9+pZ05juOkiedQVkNu3slZZ3kwcRzHaS2V2AJ4c0mPSXpR0guS\n",
       "crss9pD0sKRXJU3MbRMcnxstaaakGZL2zysfKGl6fO7KvPIuksbH8smStmzOS//+/Xn6aXjkEVhv\n",
       "PRg1qrh7SaU/MwWNFDxkoZGCh1Q0UvCQikYKHlLSKEQlWiifAN8xs52APYDTJe0I1AEPm9l2hD3g\n",
       "6wAk9QWOAvoCw4AxcQ95gLHACDPrA/SRNCyWjwDmx/IrgEsLmcm1TkaNgg02yPI2HcdxqouK51Ak\n",
       "3Q38Jj72MbN5knoBk8xsB0mjgWVmdmm8fgJwAfAm8KiZ7RjLhwNDzezUeM35ZjZFUmfgHTPbuJnX\n",
       "NjDWXTfkTjbcsA1u2HEcpx2TbA5F0lbAAGAK0NPM5sWn5gE94/EmwOy8arOBTZspnxPLiT/fAjCz\n",
       "JcD7knoU8nHGGR5MHMdxSqVim9pKWhe4EzjTzBY19WKBmVloPZSXvffem9dfr2PJkq5ccAEMGjSI\n",
       "2tpaampC+ibX11jofPbs2ay77rotvr6588WLF7PZZpu1un6OmpqaVtfPr1up+qm8nx3l88ji/Uzh\n",
       "80jl/Uzh86jU+1lfX8+kSZNobGxktZhZmz+ANYGHgLPyymYAveJxb2BGPK4D6vKumwAMAXoBL+eV\n",
       "Hw2Mzbtmj3jcGXi3OR/9+/e3733PWs2CBQtaX7mDaaTgIQuNFDykopGCh1Q0UvCQikYIG81/t7d5\n",
       "DiUm1McRkubfySu/LJZdKqkOqDGzupiUvxUYTOjK+iuwrZmZpCnAKGAq8ABwlZlNkDQS6Gdmp8Xc\n",
       "ymFmNrwZLzZ3rtGz54rPOI7jOM2xqhxKJQJKLfAE8DyQe/HRhKBwO7AFMAs40swaYp3zgBOBJYQu\n",
       "sodi+UDgBqAb8KCZ5YYgdwFuIuRn5gPDzWxWM16sre/fcRynPZNUQEmJAQMG2LRp01pdv6Gh4dP+\n",
       "xmrXSMFDFhopeEhFIwUPqWik4CEVjWRHeTmO4zgdh6puoXiXl+M4TnF4C8VxHMcpO1UdUJrbU74Y\n",
       "UllXJwWNFDxkoZGCh1Q0UvCQikYKHlLSKERVBxTHcRwnOzyHUsX37ziOUyyeQ3Ecx3HKTlUHFM+h\n",
       "ZKeRgocsNFLwkIpGCh5S0UjBQ0oahajqgOI4juNkh+dQqvj+HcdxisVzKI7jOE7ZqeqA4jmU7DRS\n",
       "8JCFRgoeUtFIwUMqGil4SEmjEFUdUBzHcZzs8BxKFd+/4zhOsXgOxXEcxyk7FQkokq6XNE/S9Lyy\n",
       "HpIelvSqpImSavKeGy1ppqQZkvbPKx8oaXp87sq88i6SxsfyyZK2bM6H51Cy00jBQxYaKXhIRSMF\n",
       "D6lopOAhJY1CVKqF8gdg2ApldcDDZrYd8Eg8J24BfBTQN9YZE7cRBhgLjDCzPkAfSTnNEYTthPsA\n",
       "VwCXNmdi0aJFJd3EU089VVL9jqSRgocsNFLwkIpGCh5S0UjBQ0oahahIQDGzJ4EFKxQfQthrnvjz\n",
       "sHh8KHCbmX0St/F9DRgiqTfQ3cymxutuzKuTr3Un8IXmfLz++usl3cc//vGPkup3JI0UPGShkYKH\n",
       "VDRS8JCKRgoeUtIoREo5lJ5mNi8ezwN6xuNNgNl5180GNm2mfE4sJ/58C8DMlgDvS+pRJt+O4zgO\n",
       "aQWUT4lDr8o+/Kpnz56rv2gVNDY2luyho2ik4CELjRQ8pKKRgodUNFLwkJJGQcysIg9gK2B63vkM\n",
       "oFc87g3MiMd1QF3edROAIUAv4OW88qOBsXnX7BGPOwPvFvBg/vCHP/zhj+Iehb7XO5MO9wLHExLo\n",
       "xwN355XfKulXhK6sPsBUMzNJCyUNAaYCxwJXraA1GTiCkORfiUJjqR3HcZziqcjERkm3AfsAGxHy\n",
       "JT8G7gFuB7YAZgFHmllDvP484ERgCXCmmT0UywcCNwDdgAfNbFQs7wLcBAwA5gPDY0LfcRzHKRNV\n",
       "PVPecRzHyY6UurzanDj0+D0z+6jCPnqZ2dxKemgNceRcH6BLrszMnmhjD8u9d6l8ppWmvf5OOe2b\n",
       "JEd5tSE3A69I+kWFfVxX4dcvGkknAY8TBkBcCDwEXFABKyu+d0V/ppL2lPQNScfHx3HFGJDUtSVl\n",
       "q6j/RnxMKeZ1V8ODGWq1GElntqSsQN3FkhYVeCws0seRktaLxz+SdJek3YqoXytp3Xh8rKRfFVpx\n",
       "YxUa20t6RNKL8XwXST8sRiPWK+n3s02p1CivVB6EoLpTC6/tRfgCmxDP+xJm6reV15viz7NK0FgM\n",
       "LCrwWFiEzguE3FV9PN8BuKsVfnoBBwMHAZ+pwGd6M/AMMAa4Ovco8vWea0lZWz6Aaa2oczmwHrAm\n",
       "YSDLf4FjS33d3O9IG9//9PizFpgUf7+mFFMfELArMA04HXi8SA9PEEakTovnAl4sUiOL38+SP9eW\n",
       "Pqq6ywvAzJYBL7bw8hsIy8b8IJ7PJAwkaKsWxkBJmwAnSrqR8Av6aRLMzN5bnYCZ5f7r+inwNuEX\n",
       "FuAbhMmiLaXRzP4nCUldzWyGpO2LqI+kIwm/7I/Hot9IOsfM/lSMzooU+ZkOBPpa/Msrhti9tgmw\n",
       "dvzvN/d5rAesXaxexlzTijr7m9k5kg4nDIz5KvAkYYDLKpF0NHAMsLWk+/Ke6k4YGNPWLI0/DwKu\n",
       "MbP7JV1URP0lZmaSDgN+a2bXShpRpIe1zWxKbqWoqPdJkRqt/v3Mo9Wfa7FUfUApko3MbLykOgAz\n",
       "+0TSkjZ8/d8R/sP4LPBsM89vXYTWIWa2S975WEnPAz9qYf23JG1AGN79sKQFhF/WYvghsLuZ/QdA\n",
       "0saE+yspoBTJC4R5T2+3ou7+wLcIw9l/mVe+CDivZGclYGZjWlEt931wEHCHmb0vqaVfZM8A7wAb\n",
       "A78gBFcI78U/W+GlVOZI+j3wJeCS2AVZTBf/oji69JvAXpI6Ef7DL4Z3JW2bO5F0BOE9KoZSfj9z\n",
       "lPK5tuqFnJaxWNKGuRNJewDvt9WLm9lVwFWSxgL/B+xN+I/4STOrL1LuA0nfBG6L58MJ3WEt9XJ4\n",
       "PLxA0iTCf+UTivQg4N288/k0fRG1FRsDL0maCuQS+WZmh6yuopmNA8ZJOsLM7iinyTbiPkkzgEbg\n",
       "NEmficerxczeBN4E9iijv2I4krCY7OVm1hBbk+cUUf8oQovrRDObK2kLQqAshjMIf6fbS3ob+Bch\n",
       "QK2WvFbeurTy9zOPVn+uxeLDhosgznu5GtiJ0KWyMXCEmbXpf2AxyXkS8OdYdDihWX9V4VoraWwN\n",
       "XAl8PhY9TZjjMytDq6vzcDmhj/pWQiA5CnjezL7fhh6GNlduZpOK1DmIkFP7NBlvZj8pxVsliP8w\n",
       "NZjZUknrEBZgXe1oMUlPm9mekhaT1w0bMTNbrxx+Uya2ir5GWBWkB7CQ8F6s9vci7/fSWPmfLDOz\n",
       "xymC1n6uxeIBpUgkrQnkcgWvmFmxfaJZeJhOWFrmg3i+DjDZzPq1tZdSkHQZMIWQODXgKcJ9tVlA\n",
       "yQJJ/0cYoLAfIXfxdUICuNg+94og6Qtm9oikr9EUDHJfYmZmfy5QtcORZWCU9BDQQOiezuV0MLNf\n",
       "Fqy0ssZlK/49SLrUzM5tqUas0w/YkfB7atHHjcVotOh1PKAUh6Q9Cf9xdKaMH8xqPEwHBpvZ/+J5\n",
       "N8JyNC0OKLHZexJN9wLhD+bEjO2uysM0MxuwQtn0tgiMGX9xTDezfpKeN7Nd4nDTCWZWm6npMiHp\n",
       "QjM7X9INrPxeYGYntL2r9o+kF8xs5xI1Sv4bkXQBYWWSnYAHgAOAp8zsiFK8NYfnUIpA0s2EhHg9\n",
       "ef9xEPZiaUv+AEyR9GfCf5KHAdcXqXEPYVjjw8CyWNYm/11IOg0YCWyjvF07CSOCnm4LD2a2Z/y5\n",
       "bgZy/4s/P5S0KSEX1CsD3TbBzM6Ph6fS1EXj3w2l84ykXczs+WIrZvw3cgSha/k5MztBUk/glmI9\n",
       "tQT/pSmOLIbwlYyZ/UrS4zR1FX3LzKYVKdOt2GZzhtwK/AW4BDiXvBFBZlaJIaalcl8c8XY5TaPv\n",
       "WjNst9LcQ1MXTRnXOO/Y5AWATsAJkt5g+YT6Ls3XXI4s/0b+F3MnSyStD/wH2LxIjRbhXV5FIOlP\n",
       "hMR1KUP4kiDOQ/mbmT1QaS8diZiI7WpxYdP2RBZdNA5I2mpVz7dk4Iuk9cxsYUymN9cNudo5Z3la\n",
       "Ywhz544CzgY+IEy2zLwr0wNKC1hhCN8AwnL5rR3ClwQxd7A28DGQG1hQlaNxsiAvt9YpV9bWubVS\n",
       "ifM2ftOaLhonWyQ9YGYHxtbNSphZi+ecxa76xwmDXv4HrFeuz9gDSgvIG8J3GWEse/4wvsvMbHCb\n",
       "m8oANS3umD/UtajhiE7h3JqZ/b+KmSqCFbpo+gCt6aJxyoCkWwjB4Ekze7mVGvsBexG6yLcFnot6\n",
       "v87MaO61PKC0nEqOSsoahcUdRwGbEb4I9yB0ge1XUWPtEEkvk0BurbVk0UXjlIcYDGoJAWEbwrpi\n",
       "RQcDSZ2BQYSh7acS8ipFLZXUotdpp38DbUr+iAvg9bynugNPm9k3KmKsBCS9AOxOCCL9Je0IXJw3\n",
       "A95pIR0pt+akR6nBQNIjwDrA3wjdXk/mljvKGh/l1TI62qgkWHlxx5dV5OKO1U7Gy2M4zko0EwwG\n",
       "tSIYPE8ISDsTZusvkPS33Dy2LPGA0gLM7H3Cml3DK+0lQ7JY3LHayc14vgw4lBVya21vx+mAlBwM\n",
       "zOw7AJK6ExYz/QNhnlSXVVRrFd7l5eQGHaxHmN39cYXttDs6Um7NSZO8YPA9oJeZtTgYSPp/hBzM\n",
       "QMKAiycJ3V6PZu3TWyhO0QshOoEUZvw7HZtmgsH1hIBQDF0Jrennyr32oLdQHKeVxFnHG9CxcmtO\n",
       "Qkg6h7BEUtmDQRZ4QHEcx3EyoZgdzBzHcRynIB5QHMdxnEzwgOI4juNkggcUx8kAST+Q9IKkf0qa\n",
       "Jqls67tJmhS3o3acpPBhw45TIpI+BxwIDDCzT+Kim5lPGsvDaKPN0BynGLyF4jil0wv4b25Yp5m9\n",
       "Z2bvSPqRpKmSpsd954FPWxi/kvR3SS9L2l3SXZJelXRRvGYrSTMk3SzpJUl/ils9L4ek/SU9I+lZ\n",
       "SbdLWieWXyLpxdhiuryN3genyvGA4jilMxHYXNIrkn4rae9Y/hszGxxnzHeTdFAsN+AjM9sdGEvY\n",
       "KfFUwvIa34pL4gBsB/zWzPoSlt0Ymf+ikjYibJz0BTMbSNhp8buxhXSYme1kZrsCF5Xrxh0nHw8o\n",
       "jlMiZvYBYSbzycC7wHhJxwP7SZos6XnCSrF986rdG3++ALxgZvPisjf/oml71rfM7G/x+GbCMuY5\n",
       "RNhyoC9h7/JpwHHAFoR15xolXSfpcJr2vHecsuI5FMfJADNbRtgI6fG4DMupQD9goJnNkXQ+eRuZ\n",
       "0bQq8bK849x57u8yP08ims+bPGxmx6xYGAcFfAE4AjgjHjtOWfEWiuOUiKTtJPXJKxoAzCAEgPmS\n",
       "1gW+3grpLSTtEY+PYfk1nAyYDOwpaZvoYx1JfWIepcbM/gJ8F9i1Fa/tOEXjLRTHKZ11gasl1QBL\n",
       "gJnAKUADoUtrLjClQN1Vjdh6BThd0vXAi4R8S1NFs/9K+hZwm6TcqLIfAIuAeyR1JbRsvtPK+3Kc\n",
       "ovC1vBwnQeK2vPf5EvhOe8K7vBwnXfy/Padd4S0Ux3EcJxO8heI4juNkggcUx3EcJxM8oDiO4ziZ\n",
       "4AHFcRzHyQQPKI7jOE4meEBxHMdxMuH/A8A9viCKa0WSAAAAAElFTkSuQmCC\n"
      ],
      "text/plain": [
       "<matplotlib.figure.Figure at 0xa5a0208>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "fdist1.plot(20, cumulative=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[u'CIRCUMNAVIGATION',\n",
       " u'Physiognomically',\n",
       " u'apprehensiveness',\n",
       " u'cannibalistically',\n",
       " u'characteristically',\n",
       " u'circumnavigating',\n",
       " u'circumnavigation',\n",
       " u'circumnavigations',\n",
       " u'comprehensiveness',\n",
       " u'hermaphroditical',\n",
       " u'indiscriminately',\n",
       " u'indispensableness',\n",
       " u'irresistibleness',\n",
       " u'physiognomically',\n",
       " u'preternaturalness',\n",
       " u'responsibilities',\n",
       " u'simultaneousness',\n",
       " u'subterraneousness',\n",
       " u'supernaturalness',\n",
       " u'superstitiousness',\n",
       " u'uncomfortableness',\n",
       " u'uncompromisedness',\n",
       " u'undiscriminating',\n",
       " u'uninterpenetratingly']"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# apply a list comprehension to get words over 15 characters\n",
    "V = set(text1)\n",
    "long_words = [w for w in V if len(w) > 15]\n",
    "sorted(long_words)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[u'#14-19teens',\n",
       " u'#talkcity_adults',\n",
       " u'((((((((((',\n",
       " u'........',\n",
       " u'Question',\n",
       " u'actually',\n",
       " u'anything',\n",
       " u'computer',\n",
       " u'cute.-ass',\n",
       " u'everyone',\n",
       " u'football',\n",
       " u'innocent',\n",
       " u'listening',\n",
       " u'remember',\n",
       " u'seriously',\n",
       " u'something',\n",
       " u'together',\n",
       " u'tomorrow',\n",
       " u'watching']"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "fdist2 = FreqDist(text5)\n",
    "sorted(w for w in set(text5) if len(w) > 7 and fdist2[w] > 7)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "United States; fellow citizens; four years; years ago; Federal\n",
      "Government; General Government; American people; Vice President; Old\n",
      "World; Almighty God; Fellow citizens; Chief Magistrate; Chief Justice;\n",
      "God bless; every citizen; Indian tribes; public debt; one another;\n",
      "foreign nations; political parties\n"
     ]
    }
   ],
   "source": [
    "# word sequences that appear together unusually often\n",
    "text4.collocations()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Raw Text Processing"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1176896"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# download raw text from an online repository\n",
    "import urllib2\n",
    "url = \"http://www.gutenberg.org/files/2554/2554.txt\"\n",
    "response = urllib2.urlopen(url)\n",
    "raw = response.read().decode('utf8')\n",
    "len(raw)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "u'The Project Gutenberg EBook of Crime and Punishment, by Fyodor Dostoevsky\\r\\n'"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "raw[:75]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "254352"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# tokenize the raw text\n",
    "from nltk import word_tokenize\n",
    "tokens = word_tokenize(raw)\n",
    "len(tokens)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[u'The',\n",
       " u'Project',\n",
       " u'Gutenberg',\n",
       " u'EBook',\n",
       " u'of',\n",
       " u'Crime',\n",
       " u'and',\n",
       " u'Punishment',\n",
       " u',',\n",
       " u'by']"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tokens[:10]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[u'CHAPTER',\n",
       " u'I',\n",
       " u'On',\n",
       " u'an',\n",
       " u'exceptionally',\n",
       " u'hot',\n",
       " u'evening',\n",
       " u'early',\n",
       " u'in',\n",
       " u'July',\n",
       " u'a',\n",
       " u'young',\n",
       " u'man',\n",
       " u'came',\n",
       " u'out',\n",
       " u'of',\n",
       " u'the',\n",
       " u'garret',\n",
       " u'in',\n",
       " u'which',\n",
       " u'he',\n",
       " u'lodged',\n",
       " u'in',\n",
       " u'S.',\n",
       " u'Place',\n",
       " u'and',\n",
       " u'walked',\n",
       " u'slowly',\n",
       " u',',\n",
       " u'as',\n",
       " u'though',\n",
       " u'in',\n",
       " u'hesitation',\n",
       " u',',\n",
       " u'towards',\n",
       " u'K.',\n",
       " u'bridge',\n",
       " u'.']"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "text = nltk.Text(tokens)\n",
    "text[1024:1062]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Katerina Ivanovna; Pyotr Petrovitch; Pulcheria Alexandrovna; Avdotya\n",
      "Romanovna; Rodion Romanovitch; Marfa Petrovna; Sofya Semyonovna; old\n",
      "woman; Project Gutenberg-tm; Porfiry Petrovitch; Amalia Ivanovna;\n",
      "great deal; Nikodim Fomitch; young man; Ilya Petrovitch; n't know;\n",
      "Project Gutenberg; Dmitri Prokofitch; Andrey Semyonovitch; Hay Market\n"
     ]
    }
   ],
   "source": [
    "text.collocations()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "5338"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "raw.find(\"PART I\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[u'BBC',\n",
       " u'NEWS',\n",
       " u'|',\n",
       " u'Health',\n",
       " u'|',\n",
       " u'Blondes',\n",
       " u\"'to\",\n",
       " u'die',\n",
       " u'out',\n",
       " u'in']"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# HTML parsing using the Beautiful Soup library\n",
    "from bs4 import BeautifulSoup\n",
    "url = \"http://news.bbc.co.uk/2/hi/health/2284783.stm\"\n",
    "html = urllib2.urlopen(url).read().decode('utf8')\n",
    "raw = BeautifulSoup(html).get_text()\n",
    "tokens = word_tokenize(raw)\n",
    "tokens[0:10]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Displaying 5 of 5 matches:\n",
      "hey say too few people now carry the gene for blondes to last beyond the next \n",
      "blonde hair is caused by a recessive gene . In order for a child to have blond\n",
      " have blonde hair , it must have the gene on both sides of the family in the g\n",
      "ere is a disadvantage of having that gene or by chance . They do n't disappear\n",
      "des would disappear is if having the gene was a disadvantage and I do not thin\n"
     ]
    }
   ],
   "source": [
    "# isolate just the article text\n",
    "tokens = tokens[110:390]\n",
    "text = nltk.Text(tokens)\n",
    "text.concordance('gene')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Regular Expressions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# regular expression library\n",
    "import re\n",
    "wordlist = [w for w in nltk.corpus.words.words('en') if w.islower()]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[u'abaissed',\n",
       " u'abandoned',\n",
       " u'abased',\n",
       " u'abashed',\n",
       " u'abatised',\n",
       " u'abed',\n",
       " u'aborted',\n",
       " u'abridged',\n",
       " u'abscessed',\n",
       " u'absconded']"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# match the end of a word\n",
    "[w for w in wordlist if re.search('ed$', w)][0:10]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[u'abjectly',\n",
       " u'adjuster',\n",
       " u'dejected',\n",
       " u'dejectly',\n",
       " u'injector',\n",
       " u'majestic',\n",
       " u'objectee',\n",
       " u'objector',\n",
       " u'rejecter',\n",
       " u'rejector']"
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# wildcard matches any single character\n",
    "[w for w in wordlist if re.search('^..j..t..$', w)][0:10]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[u'gold', u'golf', u'hold', u'hole']"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# combination of caret (start of word) and sets\n",
    "[w for w in wordlist if re.search('^[ghi][mno][jlk][def]$', w)]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[u'miiiiiiiiiiiiinnnnnnnnnnneeeeeeeeee',\n",
       " u'miiiiiinnnnnnnnnneeeeeeee',\n",
       " u'mine',\n",
       " u'mmmmmmmmiiiiiiiiinnnnnnnnneeeeeeee']"
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "chat_words = sorted(set(w for w in nltk.corpus.nps_chat.words()))\n",
    "\n",
    "# plus symbol matches any number of times repeating\n",
    "[w for w in chat_words if re.search('^m+i+n+e+$', w)]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[u'0.0085',\n",
       " u'0.05',\n",
       " u'0.1',\n",
       " u'0.16',\n",
       " u'0.2',\n",
       " u'0.25',\n",
       " u'0.28',\n",
       " u'0.3',\n",
       " u'0.4',\n",
       " u'0.5']"
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "wsj = sorted(set(nltk.corpus.treebank.words()))\n",
    "\n",
    "# more advanced regex example\n",
    "[w for w in wsj if re.search('^[0-9]+\\.[0-9]+$', w)][0:10]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[u'C$', u'US$']"
      ]
     },
     "execution_count": 33,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "[w for w in wsj if re.search('^[A-Z]+\\$$', w)]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[u'1614',\n",
       " u'1637',\n",
       " u'1787',\n",
       " u'1901',\n",
       " u'1903',\n",
       " u'1917',\n",
       " u'1925',\n",
       " u'1929',\n",
       " u'1933',\n",
       " u'1934']"
      ]
     },
     "execution_count": 34,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "[w for w in wsj if re.search('^[0-9]{4}$', w)][0:10]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[u'10-day',\n",
       " u'10-lap',\n",
       " u'10-year',\n",
       " u'100-share',\n",
       " u'12-point',\n",
       " u'12-year',\n",
       " u'14-hour',\n",
       " u'15-day',\n",
       " u'150-point',\n",
       " u'190-point']"
      ]
     },
     "execution_count": 35,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "[w for w in wsj if re.search('^[0-9]+-[a-z]{3,5}$', w)][0:10]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[u'black-and-white',\n",
       " u'bread-and-butter',\n",
       " u'father-in-law',\n",
       " u'machine-gun-toting',\n",
       " u'savings-and-loan']"
      ]
     },
     "execution_count": 36,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "[w for w in wsj if re.search('^[a-z]{5,}-[a-z]{2,3}-[a-z]{,6}$', w)][0:10]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[u'62%-owned',\n",
       " u'Absorbed',\n",
       " u'According',\n",
       " u'Adopting',\n",
       " u'Advanced',\n",
       " u'Advancing',\n",
       " u'Alfred',\n",
       " u'Allied',\n",
       " u'Annualized',\n",
       " u'Anything']"
      ]
     },
     "execution_count": 37,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "[w for w in wsj if re.search('(ed|ing)$', w)][0:10]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[(u'io', 549),\n",
       " (u'ea', 476),\n",
       " (u'ie', 331),\n",
       " (u'ou', 329),\n",
       " (u'ai', 261),\n",
       " (u'ia', 253),\n",
       " (u'ee', 217),\n",
       " (u'oo', 174),\n",
       " (u'ua', 109),\n",
       " (u'au', 106),\n",
       " (u'ue', 105),\n",
       " (u'ui', 95)]"
      ]
     },
     "execution_count": 38,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# using \"findall\" to extract partial matches from words\n",
    "fd = nltk.FreqDist(vs for word in wsj \n",
    "                      for vs in re.findall(r'[aeiou]{2,}', word))\n",
    "fd.most_common(12)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Normalizing Text"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# NLTK has several word stemmers built in\n",
    "porter = nltk.PorterStemmer()\n",
    "lancaster = nltk.LancasterStemmer()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[u'UK',\n",
       " u'Blond',\n",
       " u\"'to\",\n",
       " u'die',\n",
       " u'out',\n",
       " u'in',\n",
       " u'200',\n",
       " u\"years'\",\n",
       " u'Scientist',\n",
       " u'believ']"
      ]
     },
     "execution_count": 40,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "[porter.stem(t) for t in tokens][0:10]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[u'uk',\n",
       " u'blond',\n",
       " u\"'to\",\n",
       " u'die',\n",
       " u'out',\n",
       " u'in',\n",
       " u'200',\n",
       " u\"years'\",\n",
       " u'sci',\n",
       " u'believ']"
      ]
     },
     "execution_count": 41,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "[lancaster.stem(t) for t in tokens][0:10]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[u'UK',\n",
       " u'Blondes',\n",
       " u\"'to\",\n",
       " u'die',\n",
       " u'out',\n",
       " u'in',\n",
       " u'200',\n",
       " u\"years'\",\n",
       " u'Scientists',\n",
       " u'believe']"
      ]
     },
     "execution_count": 42,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "wnl = nltk.WordNetLemmatizer()\n",
    "[wnl.lemmatize(t) for t in tokens][0:10]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['That', 'U.S.A.', 'poster-print', 'costs', '$12.40', '...']"
      ]
     },
     "execution_count": 43,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# also has a tokenizer that takes a regular expression as a parameter\n",
    "text = 'That U.S.A. poster-print costs $12.40...'\n",
    "pattern = r'''(?x)    # set flag to allow verbose regexps\n",
    "     ([A-Z]\\.)+        # abbreviations, e.g. U.S.A.\n",
    "   | \\w+(-\\w+)*        # words with optional internal hyphens\n",
    "   | \\$?\\d+(\\.\\d+)?%?  # currency and percentages, e.g. $12.40, 82%\n",
    "   | \\.\\.\\.            # ellipsis\n",
    "   | [][.,;\"'?():-_`]  # these are separate tokens; includes ], [\n",
    "'''\n",
    "nltk.regexp_tokenize(text, pattern)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Tagging"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('They', 'PRP'),\n",
       " ('refuse', 'VBP'),\n",
       " ('to', 'TO'),\n",
       " ('permit', 'VB'),\n",
       " ('us', 'PRP'),\n",
       " ('to', 'TO'),\n",
       " ('obtain', 'VB'),\n",
       " ('the', 'DT'),\n",
       " ('refuse', 'NN'),\n",
       " ('permit', 'NN')]"
      ]
     },
     "execution_count": 44,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Use a built-in tokenizer and tagger\n",
    "text = word_tokenize(\"They refuse to permit us to obtain the refuse permit\")\n",
    "nltk.pos_tag(text)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "man time day year car moment world family house country child boy\n",
      "state job way war girl place word work\n"
     ]
    }
   ],
   "source": [
    "# Word similarity using a pre-tagged text\n",
    "text = nltk.Text(word.lower() for word in nltk.corpus.brown.words())\n",
    "text.similar('woman')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[(u'The', u'AT'),\n",
       " (u'Fulton', u'NP-TL'),\n",
       " (u'County', u'NN-TL'),\n",
       " (u'Grand', u'JJ-TL'),\n",
       " (u'Jury', u'NN-TL'),\n",
       " (u'said', u'VBD'),\n",
       " (u'Friday', u'NR'),\n",
       " (u'an', u'AT'),\n",
       " (u'investigation', u'NN'),\n",
       " (u'of', u'IN')]"
      ]
     },
     "execution_count": 46,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Tagged words are saved as tuples\n",
    "nltk.corpus.brown.tagged_words()[0:10]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[(u'The', u'DET'),\n",
       " (u'Fulton', u'NOUN'),\n",
       " (u'County', u'NOUN'),\n",
       " (u'Grand', u'ADJ'),\n",
       " (u'Jury', u'NOUN'),\n",
       " (u'said', u'VERB'),\n",
       " (u'Friday', u'NOUN'),\n",
       " (u'an', u'DET'),\n",
       " (u'investigation', u'NOUN'),\n",
       " (u'of', u'ADP')]"
      ]
     },
     "execution_count": 47,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "nltk.corpus.brown.tagged_words(tagset='universal')[0:10]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[(u'NOUN', 30640),\n",
       " (u'VERB', 14399),\n",
       " (u'ADP', 12355),\n",
       " (u'.', 11928),\n",
       " (u'DET', 11389),\n",
       " (u'ADJ', 6706),\n",
       " (u'ADV', 3349),\n",
       " (u'CONJ', 2717),\n",
       " (u'PRON', 2535),\n",
       " (u'PRT', 2264),\n",
       " (u'NUM', 2166),\n",
       " (u'X', 106)]"
      ]
     },
     "execution_count": 48,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from nltk.corpus import brown\n",
    "brown_news_tagged = brown.tagged_words(categories='news', tagset='universal')\n",
    "tag_fd = nltk.FreqDist(tag for (word, tag) in brown_news_tagged)\n",
    "tag_fd.most_common()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "VERB  ADV  ADP  ADJ    .  PRT \n",
      "  37    8    7    6    4    2 \n"
     ]
    }
   ],
   "source": [
    "# Part of speech tag count for words following \"often\" in a text\n",
    "brown_lrnd_tagged = brown.tagged_words(categories='learned', tagset='universal')\n",
    "tags = [b[1] for (a, b) in nltk.bigrams(brown_lrnd_tagged) if a[0] == 'often']\n",
    "fd = nltk.FreqDist(tags)\n",
    "fd.tabulate()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Load some raw sentences to tag\n",
    "from nltk.corpus import brown\n",
    "brown_tagged_sents = brown.tagged_sents(categories='news')\n",
    "brown_sents = brown.sents(categories='news')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('I', 'NN'),\n",
       " ('do', 'NN'),\n",
       " ('not', 'NN'),\n",
       " ('like', 'NN'),\n",
       " ('green', 'NN'),\n",
       " ('eggs', 'NN'),\n",
       " ('and', 'NN'),\n",
       " ('ham', 'NN'),\n",
       " (',', 'NN'),\n",
       " ('I', 'NN'),\n",
       " ('do', 'NN'),\n",
       " ('not', 'NN'),\n",
       " ('like', 'NN'),\n",
       " ('them', 'NN'),\n",
       " ('Sam', 'NN'),\n",
       " ('I', 'NN'),\n",
       " ('am', 'NN'),\n",
       " ('!', 'NN')]"
      ]
     },
     "execution_count": 51,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Default tagger (assigns same tag to each token)\n",
    "tags = [tag for (word, tag) in brown.tagged_words(categories='news')]\n",
    "nltk.FreqDist(tags).max()\n",
    "raw = 'I do not like green eggs and ham, I do not like them Sam I am!'\n",
    "tokens = word_tokenize(raw)\n",
    "default_tagger = nltk.DefaultTagger('NN')\n",
    "default_tagger.tag(tokens)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.13089484257215028"
      ]
     },
     "execution_count": 52,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Evaluate the performance against a tagged corpus\n",
    "default_tagger.evaluate(brown_tagged_sents)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[(u'Various', u'JJ'),\n",
       " (u'of', u'IN'),\n",
       " (u'the', u'AT'),\n",
       " (u'apartments', u'NNS'),\n",
       " (u'are', u'BER'),\n",
       " (u'of', u'IN'),\n",
       " (u'the', u'AT'),\n",
       " (u'terrace', u'NN'),\n",
       " (u'type', u'NN'),\n",
       " (u',', u','),\n",
       " (u'being', u'BEG'),\n",
       " (u'on', u'IN'),\n",
       " (u'the', u'AT'),\n",
       " (u'ground', u'NN'),\n",
       " (u'floor', u'NN'),\n",
       " (u'so', u'QL'),\n",
       " (u'that', u'CS'),\n",
       " (u'entrance', u'NN'),\n",
       " (u'is', u'BEZ'),\n",
       " (u'direct', u'JJ'),\n",
       " (u'.', u'.')]"
      ]
     },
     "execution_count": 53,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Training a unigram tagger\n",
    "from nltk.corpus import brown\n",
    "brown_tagged_sents = brown.tagged_sents(categories='news')\n",
    "brown_sents = brown.sents(categories='news')\n",
    "unigram_tagger = nltk.UnigramTagger(brown_tagged_sents)\n",
    "unigram_tagger.tag(brown_sents[2007])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.9349006503968017"
      ]
     },
     "execution_count": 54,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Now evalute it\n",
    "unigram_tagger.evaluate(brown_tagged_sents)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.9730592517453309"
      ]
     },
     "execution_count": 55,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Combining taggers\n",
    "t0 = nltk.DefaultTagger('NN')\n",
    "t1 = nltk.UnigramTagger(brown_tagged_sents, backoff=t0)\n",
    "t2 = nltk.BigramTagger(brown_tagged_sents, backoff=t1)\n",
    "t2.evaluate(brown_tagged_sents)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Classifying Text"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'last_letter': 'k'}"
      ]
     },
     "execution_count": 56,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Define a feature extractor\n",
    "def gender_features(word):\n",
    "        return {'last_letter': word[-1]}\n",
    "gender_features('Shrek')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Prepare a list of examples\n",
    "from nltk.corpus import names\n",
    "labeled_names = ([(name, 'male') for name in names.words('male.txt')] +\n",
    "    [(name, 'female') for name in names.words('female.txt')])\n",
    "import random\n",
    "random.shuffle(labeled_names)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Process the names data\n",
    "featuresets = [(gender_features(n), gender) for (n, gender) in labeled_names]\n",
    "train_set, test_set = featuresets[500:], featuresets[:500]\n",
    "classifier = nltk.NaiveBayesClassifier.train(train_set)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'male'"
      ]
     },
     "execution_count": 59,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "classifier.classify(gender_features('Neo'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 60,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'female'"
      ]
     },
     "execution_count": 60,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "classifier.classify(gender_features('Trinity'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 61,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.752\n"
     ]
    }
   ],
   "source": [
    "print(nltk.classify.accuracy(classifier, test_set))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 62,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Most Informative Features\n",
      "             last_letter = u'a'           female : male   =     35.4 : 1.0\n",
      "             last_letter = u'k'             male : female =     31.9 : 1.0\n",
      "             last_letter = u'f'             male : female =     17.4 : 1.0\n",
      "             last_letter = u'p'             male : female =     11.3 : 1.0\n",
      "             last_letter = u'm'             male : female =     10.2 : 1.0\n"
     ]
    }
   ],
   "source": [
    "classifier.show_most_informative_features(5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 63,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Document classification\n",
    "from nltk.corpus import movie_reviews\n",
    "documents = [(list(movie_reviews.words(fileid)), category)\n",
    "             for category in movie_reviews.categories()\n",
    "             for fileid in movie_reviews.fileids(category)]\n",
    "random.shuffle(documents)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 64,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())\n",
    "word_features = all_words.keys()[:2000]\n",
    "\n",
    "def document_features(document):\n",
    "    document_words = set(document)\n",
    "    features = {}\n",
    "    for word in word_features:\n",
    "        features['contains(%s)' % word] = (word in document_words)\n",
    "    return features"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 65,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "featuresets = [(document_features(d), c) for (d,c) in documents]\n",
    "train_set, test_set = featuresets[100:], featuresets[:100]\n",
    "classifier = nltk.NaiveBayesClassifier.train(train_set)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 66,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.64\n"
     ]
    }
   ],
   "source": [
    "print(nltk.classify.accuracy(classifier, test_set))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 67,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Most Informative Features\n",
      "          contains(sans) = True              neg : pos    =      8.4 : 1.0\n",
      "     contains(uplifting) = True              pos : neg    =      8.2 : 1.0\n",
      "    contains(mediocrity) = True              neg : pos    =      7.7 : 1.0\n",
      "     contains(dismissed) = True              pos : neg    =      7.0 : 1.0\n",
      "   contains(overwhelmed) = True              pos : neg    =      6.3 : 1.0\n"
     ]
    }
   ],
   "source": [
    "classifier.show_most_informative_features(5)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
